8000 Feat: Add cloud `CodeChunker` + tests by chonknick · Pull Request #157 · chonkie-inc/chonkie · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Feat: Add cloud CodeChunker + tests #157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 23, 2025
2 changes: 2 additions & 0 deletions src/chonkie/cloud/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from .chunker import (
CloudChunker,
CodeChunker,
LateChunker,
NeuralChunker,
RecursiveChunker,
Expand All @@ -20,6 +21,7 @@
"SentenceChunker",
"LateChunker",
"SDPMChunker",
"CodeChunker",
"NeuralChunker",
"SlumberChunker",
]
2 changes: 2 additions & 0 deletions src/chonkie/cloud/chunker/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""Module for Chonkie Cloud Chunkers."""

from .base import CloudChunker
from .code import CodeChunker
from .late import LateChunker
from .neural import NeuralChunker
from .recursive import RecursiveChunker
Expand All @@ -18,6 +19,7 @@
"SentenceChunker",
"LateChunker",
"SDPMChunker",
"CodeChunker",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider moving 'CodeChunker' between 'CloudChunker' and 'LateChunker' to maintain alphabetical ordering in all

"NeuralChunker",
"SlumberChunker",
]
114 changes: 114 additions & 0 deletions src/chonkie/cloud/chunker/code.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
"""Code Chunking for Chonkie API."""

import os
from typing import Dict, List, Literal, Optional, Union, cast

import requests

from .base import CloudChunker


class CodeChunker(CloudChunker):
"""Code Chunking for Chonkie API."""

BASE_URL = "https://api.chonkie.ai"
VERSION = "v1"

def __init__(
self,
tokenizer_or_token_counter: str = "gpt2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hint Union[str, List] for tokenizer_or_token_counter is a bit ambiguous in the context of this cloud client.

  1. If List is intended to allow, for example, a list of pre-computed token IDs (List[int]) or a list of tokenizer names for some advanced API feature, the payload ("tokenizer_or_token_counter": self.tokenizer_or_token_counter) would send this list. Does the Chonkie API endpoint /v1/chunk/code actually support receiving a list for this parameter? The current tests only demonstrate usage with string values (e.g., "gpt2").
  2. If the API endpoint for this cloud chunker only accepts a string for tokenizer_or_token_counter, then including List in the Union might be misleading for users of this specific class.

Could you clarify the intended use and API support for List here? If it's only ever a string for this cloud chunker, str might be a more precise type hint. If List is supported, adding a test case for it would be beneficial.

chunk_size: int = 512,
language: Union[Literal["auto"], str] = "auto",
return_type: Literal["texts", "chunks"] = "chunks",
api_key: Optional[str] = None,
) -> None:
"""Initialize the Cloud CodeChunker.

Args:
tokenizer_or_token_counter: The tokenizer or token counter to use.
chunk_size: The size of the chunks to create.
language: The language of the code to parse. Accepts any of the languages supported by tree-sitter-language-pack.
return_type: The type of the return value.
api_key: The API key for the Chonkie API.

Raises:
ValueError: If the API key is not provided or if parameters are invalid.

"""
# If no API key is provided, use the environment variable
self.api_key = api_key or os.getenv("CHONKIE_API_KEY")
if not self.api_key:
raise ValueError(
"No API key provided. Please set the CHONKIE_API_KEY environment variable"
+ " or pass an API key to the CodeChunker constructor."
)

# Validate parameters
if chunk_size <= 0:
raise ValueError("Chunk size must be greater than 0.")
if return_type not in ["texts", "chunks"]:
raise ValueError("Return type must be either 'texts' or 'chunks'.")

# Assign all the attributes to the instance
self.tokenizer_or_token_counter = tokenizer_or_token_counter
self.chunk_size = chunk_size
self.language = language
self.return_type = return_type

# Check if the API is up right now
response = requests.get(f"{self.BASE_URL}/")
if response.status_code != 200:
raise ValueError(
"Oh no! You caught Chonkie at a bad time. It seems to be down right now."
+ " Please try again in a short while."
+ " If the issue persists, please contact support at support@chonkie.ai or raise an issue on GitHub."
)
Comment on lines +58 to +65
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: API health check in constructor could cause initialization failures in offline environments or when API is temporarily down. Consider moving this to a separate method that can be called explicitly.

Comment on lines +59 to +65
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The API health check (requests.get(f"{self.BASE_URL}/")) performed in the __init__ method could introduce some potential issues:

  • Performance Impact: It adds network latency each time a CodeChunker object is instantiated.
  • Usability Concern: If the API's base endpoint (/) is temporarily unresponsive, CodeChunker objects cannot be created. This could be problematic even if the user only intends to configure the object without immediately calling chunk().
  • Resource Usage: Frequent instantiation could lead to unnecessary network traffic to the health check endpoint.

Have you considered alternatives, such as:

  • Performing this check lazily, just before the first actual API call in the chunk() method?
  • Making it an optional, explicit health check method that users can invoke if they need to verify connectivity?

This change could improve instantiation performance and make the class more resilient to transient network or API issues.


def chunk(self, text: Union[str, List[str]]) -> List[Dict]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current return type annotation -> List[Dict] appears to be incorrect when the input text parameter is a List[str] (for batch processing).

Your tests (specifically test_cloud_code_chunker_batch) correctly assert that for a list of input texts, the result is a list of lists of chunks (List[List[Dict]]). However, this method's signature and the cast on line 105 (result: List[Dict] = cast(List[Dict], response.json())) do not reflect this batch behavior.

This discrepancy can lead to type errors and confusion for users of the CodeChunker.

To address this, you could:

  1. Change the return type annotation to Union[List[Dict], List[List[Dict]]].
  2. Adjust the logic around line 105 to correctly cast or type response.json() based on whether the input text was a single string or a list of strings.
  3. For a more type-safe API, consider using typing.overload to define distinct signatures for single string input and list-of-strings input.

At a minimum, the return type annotation should be updated to reflect the possible List[List[Dict]] structure.

Suggested change
def chunk(self, text: Union[str, List[str]]) -> List[Dict]:
def chunk(self, text: Union[str, List[str]]) -> Union[List[Dict], List[List[Dict]]]:

"""Chunk the code into a list of chunks.

Args:
text: The code text(s) to chunk.

Returns:
A list of chunk dictionaries containing the chunked code.

Raises:
ValueError: If the API request fails or returns invalid data.

"""
# Define the payload for the request
payload = {
"text": text,
"tokenizer_or_token_counter": self.tokenizer_or_token_counter,
"chunk_size": self.chunk_size,
"language": self.language,
"include_nodes": False, # API doesn't support tree-sitter nodes
"return_type": self.return_type,
}

# Make the request to the Chonkie API
response = requests.post(
f"{self.BASE_URL}/{self.VERSION}/chunk/code",
json=payload,
headers={"Authorization": f"Bearer {self.api_key}"},
)

# Check if the response is successful
if response.status_code != 200:
raise ValueError(
f"Error from the Chonkie API: {response.status_code} {response.text}"
)

# Parse the response
try:
result: List[Dict] = cast(List[Dict], response.json())
except Exception as error:
raise ValueError(f"Error parsing the response: {error}") from error
Comment on lines +106 to +107
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a generic Exception when parsing the JSON response is quite broad. This can make debugging more difficult as it might catch and obscure unrelated errors that are not specific to JSON decoding.

Would it be possible to catch a more specific exception here? For instance, requests.exceptions.JSONDecodeError (if response.json() from the requests library is used and can raise this) or json.JSONDecodeError would be more targeted to issues during the parsing of the JSON response.

Suggested change
except Exception as error:
raise ValueError(f"Error parsing the response: {error}") from error
except requests.exceptions.JSONDecodeError as error:
raise ValueError(f"Error parsing the response: {error}") from error


# Return the result
return result

def __call__(self, text: Union[str, List[str]]) -> List[Dict]:
"""Call the chunker."""
return self.chunk(text)
Loading
Loading
0