-
Notifications
You must be signed in to change notification settings - Fork 84
Feat: Add cloud CodeChunker
+ tests
#157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Introduced CodeChunker to the cloud chunker module. - Updated __init__.py files to include CodeChunker in the exports for both the cloud and chunker modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @chonknick, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! I'm gemini-code-assist, here to provide a summary of this pull request. This PR introduces a new CodeChunker
class to the Chonkie Cloud API client library. The primary goal is to provide a dedicated chunker specifically designed for processing code, leveraging the Chonkie API's capabilities for code-aware splitting. The changes involve adding the new class implementation, making it accessible through the module's public API, and adding a comprehensive suite of tests to ensure its functionality and robustness.
Highlights
- New Feature: CodeChunker: Adds the
CodeChunker
class (src/chonkie/cloud/chunker/code.py
) which interacts with the Chonkie Cloud API to perform code-specific chunking. It supports configuration for tokenizer, chunk size, language (including auto-detection), and return type ('texts' or 'chunks'). It includes validation for input parameters and handles API key management (via constructor or environment variable). - Module Exports Updated: Updates
src/chonkie/cloud/__init__.py
andsrc/chonkie/cloud/chunker/__init__.py
to import and export the newCodeChunker
class, making it part of the public API for thechonkie.cloud
andchonkie.cloud.chunker
modules. - Comprehensive Test Suite: Adds a new test file (
tests/cloud/test_cloud_code_chunker.py
) with extensive unit tests for theCodeChunker
. Tests cover initialization validation, chunking for simple and complex code (Python, JavaScript), auto language detection, batch processing, handling empty/whitespace input, adherence to chunk size limits, continuity of chunk indices, and compatibility with different tokenizers. Tests are marked to be skipped if theCHONKIE_API_KEY
environment variable is not set.
Changelog
Click here to see the changelog
- src/chonkie/cloud/init.py
- Added
CodeChunker
to the import list from.chunker
(line 5). - Added
CodeChunker
to the__all__
export list (line 24).
- Added
- src/chonkie/cloud/chunker/init.py
- Added import for
CodeChunker
from.code
(line 4). - Added
CodeChunker
to the__all__
export list (line 22).
- Added import for
- src/chonkie/cloud/chunker/code.py
- Added the
CodeChunker
class definition. - Implemented the
__init__
method to handle API key, validatechunk_size
andreturn_type
, set instance attributes, and check API reachability. - Implemented the
chunk
method to construct the API request payload, make a POST request to the/v1/chunk/code
endpoint, handle API errors, and parse the JSON response. - Implemented the
__call__
method as an alias for thechunk
method.
- Added the
- tests/cloud/test_cloud_code_chunker.py
- Added a new test file for
CodeChunker
. - Added
python_code
andjs_code
pytest fixtures. - Added
test_cloud_code_chunker_initialization
to test constructor validation. - Added
test_cloud_code_chunker_simple
for basic chunking. - Added
test_cloud_code_chunker_python_complex
for more complex Python code, including reconstruction check. - Added
test_cloud_code_chunker_javascript
for JavaScript code, including reconstruction check. - Added
test_cloud_code_chunker_auto_language
to test auto-detection. - Added
test_cloud_code_chunker_no_nodes_support
to confirm node output is not expected (due to API). - Added
test_cloud_code_chunker_batch
to test processing a list of texts. - Added
test_cloud_code_chunker_return_type_texts
to test the 'texts' output format. - Added
test_cloud_code_chunker_empty_text
andtest_cloud_code_chunker_whitespace_text
for edge cases. - Added
test_cloud_code_chunker_chunk_size_adherence
to check token count limits. - Added
test_cloud_code_chunker_indices_continuity
to verify start/end indices. - Added
test_cloud_code_chunker_different_tokenizers
to test tokenizer options.
- Added a new test file for
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Code flows like a stream,
Chunked neatly, a programmer's dream.
Tests confirm the split,
Ensuring it is fit,
For models, a useful scheme.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
Added a new CodeChunker
class to the Chonkie Cloud API for intelligent code chunking with API integration and extensive test coverage.
- Implemented
src/chonkie/cloud/chunker/code.py
with API key validation, error handling, and support for multiple languages and tokenizers - Added comprehensive test suite in
tests/cloud/test_cloud_code_chunker.py
covering edge cases, batch processing, and chunk size adherence - Included API health check in constructor to provide immediate feedback if service is unavailable
- Added proper type hints and docstrings throughout the implementation
- Implemented chunk reconstruction validation to ensure lossless code splitting
4 file(s) reviewed, 4 comment(s)
Edit PR Review Bot Settings | Greptile
@@ -18,6 +19,7 @@ | |||
"SentenceChunker", | |||
"LateChunker", | |||
"SDPMChunker", | |||
"CodeChunker", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider moving 'CodeChunker' between 'CloudChunker' and 'LateChunker' to maintain alphabetical ordering in all
# Check if the API is up right now | ||
response = requests.get(f"{self.BASE_URL}/") | ||
if response.status_code != 200: | ||
raise ValueError( | ||
"Oh no! You caught Chonkie at a bad time. It seems to be down right now." | ||
+ " Please try again in a short while." | ||
+ " If the issue persists, please contact support at support@chonkie.ai or raise an issue on GitHub." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: API health check in constructor could cause initialization failures in offline environments or when API is temporarily down. Consider moving this to a separate method that can be called explicitly.
result = code_chunker(" \n\t\n ") | ||
# Should return empty list or minimal chunks | ||
assert isinstance(result, list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Missing assertion for expected behavior with whitespace-only input. Should explicitly assert empty list or specific chunk count.
result = code_chunker(" \n\t\n ") | |
# Should return empty list or minimal chunks | |
assert isinstance(result, list) | |
result = code_chunker(" \n\t\n ") | |
# Whitespace-only input should return either empty list or single chunk | |
assert isinstance(result, list) | |
assert len(result) <= 1, "Whitespace-only input should produce at most one chunk" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the CodeChunker
for the Chonkie Cloud API, along with comprehensive unit tests. The implementation is generally solid, and the test coverage is good, addressing various languages, batching, and edge cases.
I have a few suggestions focusing on improving type safety for batch operations, the behavior of the API health check in the constructor, exception handling, and test assertion strength. Addressing the high-severity type mismatch issue is particularly important before merging.
Well done on the detailed test suite!
Summary of Findings
- Return Type Mismatch for Batch Processing: The
chunk
method inCodeChunker
has a return type annotationList[Dict]
, but for batch input (List[str]
), it actually returnsList[List[Dict]]
. This mismatch affects type safety and clarity. (File:src/chonkie/cloud/chunker/code.py
, Lines: 67, 105) - API Health Check in Constructor: Performing an API health check within the
__init__
method ofCodeChunker
can lead to performance issues during instantiation and make the class unusable if the API is temporarily down. (File:src/chonkie/cloud/chunker/code.py
, Lines: 59-65) - Generic Exception Handling: The
chunk
method catches a broadException
when parsing the API response, which can hide specific errors. It's recommended to catch more specific exceptions likerequests.exceptions.JSONDecodeError
. (File:src/chonkie/cloud/chunker/code.py
, Line: 106) - Ambiguous Type Hint for
tokenizer_or_token_counter
: The type hintUnion[str, List]
fortokenizer_or_token_counter
inCodeChunker
is unclear regarding how aList
would be used or if the API supports it for this parameter. (File:src/chonkie/cloud/chunker/code.py
, Line: 19) - Weak Assertion in Whitespace Text Test: The test
test_cloud_code_chunker_whitespace_text
has a weak assertion (isinstance(result, list)
). It should be strengthened to verify the expected output, likely an empty list. (File:tests/cloud/test_cloud_code_chunker.py
, Line: 291-293) - Missing API Error Handling Tests: The test suite for
CodeChunker
lacks tests that mock API failure scenarios (e.g., 4xx/5xx errors) to ensure proper exception handling by the client. This was not commented on directly due to review settings. (File:tests/cloud/test_cloud_code_chunker.py
) - Minor PEP 8/Stylistic Issues: Minor stylistic issues like missing newlines at the end of files and use of string concatenation instead of f-strings were observed. These were not commented on directly due to review settings. (Files:
src/chonkie/cloud/chunker/code.py
,tests/cloud/test_cloud_code_chunker.py
)
Merge Readiness
The pull request is a significant contribution, but there is a high-severity issue regarding the return type of the chunk
method when handling batch inputs that should be addressed before merging. Additionally, considering the medium-severity feedback would further improve the robustness and usability of the CodeChunker
.
Once the high-severity issue is resolved and other feedback points are considered, this PR will be in a much stronger position for merging. As a reviewer, I am not authorized to approve pull requests; please ensure further review and approval from authorized maintainers.
+ " If the issue persists, please contact support at support@chonkie.ai or raise an issue on GitHub." | ||
) | ||
|
||
def chunk(self, text: Union[str, List[str]]) -> List[Dict]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current return type annotation -> List[Dict]
appears to be incorrect when the input text
parameter is a List[str]
(for batch processing).
Your tests (specifically test_cloud_code_chunker_batch
) correctly assert that for a list of input texts, the result is a list of lists of chunks (List[List[Dict]]
). However, this method's signature and the cast
on line 105 (result: List[Dict] = cast(List[Dict], response.json())
) do not reflect this batch behavior.
This discrepancy can lead to type errors and confusion for users of the CodeChunker
.
To address this, you could:
- Change the return type annotation to
Union[List[Dict], List[List[Dict]]]
. - Adjust the logic around line 105 to correctly cast or type
response.json()
based on whether the inputtext
was a single string or a list of strings. - For a more type-safe API, consider using
typing.overload
to define distinct signatures for single string input and list-of-strings input.
At a minimum, the return type annotation should be updated to reflect the possible List[List[Dict]]
structure.
def chunk(self, text: Union[str, List[str]]) -> List[Dict]: | |
def chunk(self, text: Union[str, List[str]]) -> Union[List[Dict], List[List[Dict]]]: |
|
||
def __init__( | ||
self, | ||
tokenizer_or_token_counter: str = "gpt2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint Union[str, List]
for tokenizer_or_token_counter
is a bit ambiguous in the context of this cloud client.
- If
List
is intended to allow, for example, a list of pre-computed token IDs (List[int]
) or a list of tokenizer names for some advanced API feature, the payload ("tokenizer_or_token_counter": self.tokenizer_or_token_counter
) would send this list. Does the Chonkie API endpoint/v1/chunk/code
actually support receiving a list for this parameter? The current tests only demonstrate usage with string values (e.g., "gpt2"). - If the API endpoint for this cloud chunker only accepts a string for
tokenizer_or_token_counter
, then includingList
in theUnion
might be misleading for users of this specific class.
Could you clarify the intended use and API support for List
here? If it's only ever a string for this cloud chunker, str
might be a more precise type hint. If List
is supported, adding a test case for it would be beneficial.
response = requests.get(f"{self.BASE_URL}/") | ||
if response.status_code != 200: | ||
raise ValueError( | ||
"Oh no! You caught Chonkie at a bad time. It seems to be down right now." | ||
+ " Please try again in a short while." | ||
+ " If the issue persists, please contact support at support@chonkie.ai or raise an issue on GitHub." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API health check (requests.get(f"{self.BASE_URL}/")
) performed in the __init__
method could introduce some potential issues:
- Performance Impact: It adds network latency each time a
CodeChunker
object is instantiated. - Usability Concern: If the API's base endpoint (
/
) is temporarily unresponsive,CodeChunker
objects cannot be created. This could be problematic even if the user only intends to configure the object without immediately callingchunk()
. - Resource Usage: Frequent instantiation could lead to unnecessary network traffic to the health check endpoint.
Have you considered alternatives, such as:
- Performing this check lazily, just before the first actual API call in the
chunk()
method? - Making it an optional, explicit health check method that users can invoke if they need to verify connectivity?
This change could improve instantiation performance and make the class more resilient to transient network or API issues.
except Exception as error: | ||
raise ValueError(f"Error parsing the response: {error}") from error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching a generic Exception
when parsing the JSON response is quite broad. This can make debugging more difficult as it might catch and obscure unrelated errors that are not specific to JSON decoding.
Would it be possible to catch a more specific exception here? For instance, requests.exceptions.JSONDecodeError
(if response.json()
from the requests
library is used and can raise this) or json.JSONDecodeError
would be more targeted to issues during the parsing of the JSON response.
except Exception as error: | |
raise ValueError(f"Error parsing the response: {error}") from error | |
except requests.exceptions.JSONDecodeError as error: | |
raise ValueError(f"Error parsing the response: {error}") from error |
❌ 276 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Introduced a new test suite for the JSONPorter class, covering initialization, export to JSON and JSONL formats, and handling of empty chunk lists. - Validated chunk serialization, context inclusion, and indentation in exported files. - Implemented tests for large chunk lists and Unicode content handling. - Ensured proper error handling for file permission issues and support for Path objects.
… annotations - Added return type annotations to all test functions for improved clarity and type checking. - Updated the temp_dir fixture to specify its return type as a Generator.
- Restructured test cases into classes for better organization and clarity. - Added tests for Model2Vec and SentenceTransformer embeddings, including actual embedding generation. - Implemented provider prefix tests for OpenAI, Cohere, VoyageAI, and Jina embeddings. - Enhanced error handling tests for invalid provider prefixes and model identifiers. - Included tests for handling existing embeddings instances and custom embeddings objects.
- Restructured test cases into classes for improved organization and clarity. - Added tests for initialization with explicit and environment API keys, including error handling for missing keys. - Implemented tests for custom model initialization and tokenizer handling. - Enhanced tests for embedding methods, including single and batch embeddings with mocked API responses. - Validated similarity calculations and error handling for various edge cases.
- Restructured test cases into classes for improved organization and clarity. - Added tests for initialization with default and custom models, including error handling for invalid models and missing API keys. - Enhanced tests for embedding methods, including synchronous and asynchronous embedding with mocked API responses. - Implemented tests for token counting, dimension properties, and similarity checks between embeddings. - Validated handling of edge cases and error scenarios, including empty inputs and API errors.
This pull request introduces the
CodeChunker
class to the Chonkie Cloud API for chunking code, along with corresponding updates to the module's initialization files and a comprehensive set of tests. The most important changes include the implementation of theCodeChunker
class, updates to module imports and exports, and the addition of extensive unit tests to ensure functionality.New Feature: CodeChunker Implementation
CodeChunker
class insrc/chonkie/cloud/chunker/code.py
, which provides functionality for chunking code using the Chonkie API. It supports configurable chunk sizes, tokenizers, languages, and return types. The class validates input parameters, handles API requests, and processes responses.Updates to Module Initialization
src/chonkie/cloud/__init__.py
andsrc/chonkie/cloud/chunker/__init__.py
to include theCodeChunker
class in imports and exports, ensuring it is accessible as part of the module's public API. [1] [2] [3] [4]Unit Tests for CodeChunker
tests/cloud/test_cloud_code_chunker.py
to validate the functionality of theCodeChunker
class. Tests cover initialization, parameter validation, chunking behavior for various programming languages, batch processing, adherence to chunk size limits, and continuity of chunk indices. Additionally, tests ensure compatibility with different tokenizers and handle edge cases like empty or whitespace-only inputs.- Introduced CodeChunker to the cloud chunker module.