-
Notifications
You must be signed in to change notification settings - Fork 84
Feat: Add cloud CodeChunker
+ tests
#157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
73d0588
8fd99e9
1168dcf
50c72a3
46f1951
e6fd513
d35089b
7b89fb8
98d122d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,114 @@ | ||||||||||
"""Code Chunking for Chonkie API.""" | ||||||||||
|
||||||||||
import os | ||||||||||
from typing import Dict, List, Literal, Optional, Union, cast | ||||||||||
|
||||||||||
import requests | ||||||||||
|
||||||||||
from .base import CloudChunker | ||||||||||
|
||||||||||
|
||||||||||
class CodeChunker(CloudChunker): | ||||||||||
"""Code Chunking for Chonkie API.""" | ||||||||||
|
||||||||||
BASE_URL = "https://api.chonkie.ai" | ||||||||||
VERSION = "v1" | ||||||||||
|
||||||||||
def __init__( | ||||||||||
self, | ||||||||||
tokenizer_or_token_counter: str = "gpt2", | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint
Could you clarify the intended use and API support for |
||||||||||
chunk_size: int = 512, | ||||||||||
language: Union[Literal["auto"], str] = "auto", | ||||||||||
return_type: Literal["texts", "chunks"] = "chunks", | ||||||||||
api_key: Optional[str] = None, | ||||||||||
) -> None: | ||||||||||
"""Initialize the Cloud CodeChunker. | ||||||||||
|
||||||||||
Args: | ||||||||||
tokenizer_or_token_counter: The tokenizer or token counter to use. | ||||||||||
chunk_size: The size of the chunks to create. | ||||||||||
language: The language of the code to parse. Accepts any of the languages supported by tree-sitter-language-pack. | ||||||||||
return_type: The type of the return value. | ||||||||||
api_key: The API key for the Chonkie API. | ||||||||||
|
||||||||||
Raises: | ||||||||||
ValueError: If the API key is not provided or if parameters are invalid. | ||||||||||
|
||||||||||
""" | ||||||||||
# If no API key is provided, use the environment variable | ||||||||||
self.api_key = api_key or os.getenv("CHONKIE_API_KEY") | ||||||||||
if not self.api_key: | ||||||||||
raise ValueError( | ||||||||||
"No API key provided. Please set the CHONKIE_API_KEY environment variable" | ||||||||||
+ " or pass an API key to the CodeChunker constructor." | ||||||||||
) | ||||||||||
|
||||||||||
# Validate parameters | ||||||||||
if chunk_size <= 0: | ||||||||||
raise ValueError("Chunk size must be greater than 0.") | ||||||||||
if return_type not in ["texts", "chunks"]: | ||||||||||
raise ValueError("Return type must be either 'texts' or 'chunks'.") | ||||||||||
|
||||||||||
# Assign all the attributes to the instance | ||||||||||
self.tokenizer_or_token_counter = tokenizer_or_token_counter | ||||||||||
self.chunk_size = chunk_size | ||||||||||
self.language = language | ||||||||||
self.return_type = return_type | ||||||||||
|
||||||||||
# Check if the API is up right now | ||||||||||
response = requests.get(f"{self.BASE_URL}/") | ||||||||||
if response.status_code != 200: | ||||||||||
raise ValueError( | ||||||||||
"Oh no! You caught Chonkie at a bad time. It seems to be down right now." | ||||||||||
+ " Please try again in a short while." | ||||||||||
+ " If the issue persists, please contact support at support@chonkie.ai or raise an issue on GitHub." | ||||||||||
) | ||||||||||
Comment on lines
+58
to
+65
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. style: API health check in constructor could cause initialization failures in offline environments or when API is temporarily down. Consider moving this to a separate method that can be called explicitly.
Comment on lines
+59
to
+65
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The API health check (
Have you considered alternatives, such as:
This change could improve instantiation performance and make the class more resilient to transient network or API issues. |
||||||||||
|
||||||||||
def chunk(self, text: Union[str, List[str]]) -> List[Dict]: | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current return type annotation Your tests (specifically This discrepancy can lead to type errors and confusion for users of the To address this, you could:
At a minimum, the return type annotation should be updated to reflect the possible
Suggested change
|
||||||||||
"""Chunk the code into a list of chunks. | ||||||||||
|
||||||||||
Args: | ||||||||||
text: The code text(s) to chunk. | ||||||||||
| ||||||||||
Returns: | ||||||||||
A list of chunk dictionaries containing the chunked code. | ||||||||||
|
||||||||||
Raises: | ||||||||||
ValueError: If the API request fails or returns invalid data. | ||||||||||
|
||||||||||
""" | ||||||||||
# Define the payload for the request | ||||||||||
payload = { | ||||||||||
"text": text, | ||||||||||
"tokenizer_or_token_counter": self.tokenizer_or_token_counter, | ||||||||||
"chunk_size": self.chunk_size, | ||||||||||
"language": self.language, | ||||||||||
"include_nodes": False, # API doesn't support tree-sitter nodes | ||||||||||
"return_type": self.return_type, | ||||||||||
} | ||||||||||
|
||||||||||
# Make the request to the Chonkie API | ||||||||||
response = requests.post( | ||||||||||
f"{self.BASE_URL}/{self.VERSION}/chunk/code", | ||||||||||
json=payload, | ||||||||||
headers={"Authorization": f"Bearer {self.api_key}"}, | ||||||||||
) | ||||||||||
|
||||||||||
# Check if the response is successful | ||||||||||
if response.status_code != 200: | ||||||||||
raise ValueError( | ||||||||||
f"Error from the Chonkie API: {response.status_code} {response.text}" | ||||||||||
) | ||||||||||
|
||||||||||
# Parse the response | ||||||||||
try: | ||||||||||
result: List[Dict] = cast(List[Dict], response.json()) | ||||||||||
except Exception as error: | ||||||||||
raise ValueError(f"Error parsing the response: {error}") from error | ||||||||||
Comment on lines
+106
to
+107
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Catching a generic Would it be possible to catch a more specific exception here? For instance,
Suggested change
|
||||||||||
|
||||||||||
# Return the result | ||||||||||
return result | ||||||||||
|
||||||||||
def __call__(self, text: Union[str, List[str]]) -> List[Dict]: | ||||||||||
"""Call the chunker.""" | ||||||||||
return self.chunk(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider moving 'CodeChunker' between 'CloudChunker' and 'LateChunker' to maintain alphabetical ordering in all