-
Notifications
You must be signed in to change notification settings - Fork 83
Feat: Create Cython functions for split
and merge
basic ops for chunking!
#163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… if we can speed it up!
…utilize it - Introduced a new Cython module for optimized text splitting. - Updated `RecursiveChunker`, `SemanticChunker`, `SentenceChunker`, and `SlumberChunker` to use the new `split_text` function when available. - Enhanced fallback mechanisms for text splitting in case Cython is not available. - Added `.temp` directory to `.gitignore` and included `CLAUDE.md` file.
- Introduced a new Cython module for merging text splits, improving performance by approximately 50%. - Updated `RecursiveChunker` to utilize the optimized merge function when available, with a Python fallback. - Enhanced documentation for the merging process and added error handling for input validation.
…unker - Introduced `find_merge_indices` in the Cython module to enhance performance for merging token counts. - Updated `SentenceChunker` to utilize the new Cython function when available, with a fallback to the existing Python implementation. - Improved handling of cumulative token counts for better efficiency in chunking operations.
- Updated setup.py to include documentation and removed the token_chunker extension, which is no longer in use. - Simplified docstring formatting in RecursiveChunker for clarity.
- Deleted `test_cython_token_chunker.py`, which contained tests for the Cython token chunking functionality that is no longer in use. - This cleanup helps streamline the codebase by removing obsolete tests.
…nd EmbeddingsRegistry. Introduced provider alias support, improved error handling, and streamlined model registration methods for better clarity and maintainability.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…k for default SentenceTransformerEmbeddings in case of registry lookup failure.
…_model for jina-embeddings-v2 types, enhancing consistency in model registration.
…anging 'type' to 'type_alias' for better readability in the embedding registration process.
…ppress type checking errors
…the imports and __all__ list for improved functionality.
…various chunkers. Removed API_KEY from class variable and enhanced exception messages for better clarity. Added comprehensive tests for LateChunker, NeuralChunker, SDPMChunker, and SlumberChunker to ensure robust functionality and error handling.
…for API connectivity issues and improving error messages for better user guidance. Ensure clarity in API key requirements and response handling.
…r, SentenceChunker, and TokenChunker for improved security and consistency across chunker implementations.
…ity and consistency across chunker implementations.
… API connectivity issues and invalid responses, enhancing user experience and support contact information.
- Introduced tests for edge cases, including empty text, special characters, and whitespace handling. - Implemented error handling tests for invalid tokens in both character and word tokenizers. - Verified consistency across encoding, decoding, and token counting operations. - Added tests for batch operations and error propagation. - Enhanced coverage for tokenizer initialization and backend detection accuracy.
- Introduced a new test suite for the SlumberChunker, covering initialization, chunking functionality, and edge cases. - Implemented tests for various text splitting methods, including whitespace and delimiter-based approaches. - Validated chunk properties and ensured proper handling of different input scenarios, including empty text and large documents. - Enhanced test coverage for prompt generation and genie interaction, ensuring robust functionality of the SlumberChunker.
- Renamed `is_available` method to `_is_available` in all embeddings and refinery classes for consistency and to indicate that these methods are intended for internal use. - Updated corresponding calls in the implementations to reflect the new method name. - Adjusted tests to verify the functionality of the renamed method.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Introduced a new test suite for the JSONPorter class, covering initialization, export to JSON and JSONL formats, and handling of empty chunk lists. - Validated chunk serialization, context inclusion, and indentation in exported files. - Implemented tests for large chunk lists and Unicode content handling. - Ensured proper error handling for file permission issues and support for Path objects.
… annotations - Added return type annotations to all test functions for improved clarity and type checking. - Updated the temp_dir fixture to specify its return type as a Generator.
- Restructured test cases into classes for better organization and clarity. - Added tests for Model2Vec and SentenceTransformer embeddings, including actual embedding generation. - Implemented provider prefix tests for OpenAI, Cohere, VoyageAI, and Jina embeddings. - Enhanced error handling tests for invalid provider prefixes and model identifiers. - Included tests for handling existing embeddings instances and custom embeddings objects.
- Restructured test cases into classes for improved organization and clarity. - Added tests for initialization with explicit and environment API keys, including error handling for missing keys. - Implemented tests for custom model initialization and tokenizer handling. - Enhanced tests for embedding methods, including single and batch embeddings with mocked API responses. - Validated similarity calculations and error handling for various edge cases.
- Restructured test cases into classes for improved organization and clarity. - Added tests for initialization with default and custom models, including error handling for invalid models and missing API keys. - Enhanced tests for embedding methods, including synchronous and asynchronous embedding with mocked API responses. - Implemented tests for token counting, dimension properties, and similarity checks between embeddings. - Validated handling of edge cases and error scenarios, including empty inputs and API errors.
- Introduced mocking for API responses to enhance test reliability and avoid external dependencies. - Updated test cases for CodeChunker, LateChunker, RecursiveChunker, SDPMChunker, SemanticChunker, and SlumberChunker to include API key handling. - Added comprehensive tests for various scenarios, including single and batch text processing, empty inputs, and return type validations. - Improved error handling tests for invalid configurations and ensured consistent behavior across chunkers. - Enhanced readability and organization of test cases for better maintainability.
- Introduced mocking for Cohere API dependencies to avoid real API calls during tests. - Updated test cases to use a test API key, ensuring consistent behavior without requiring environment variables. - Added a new test for real API integration, marked as disabled for CI, to validate functionality when the API key is available. - Improved assertions in similarity tests to accommodate a broader range of expected values.
- Removed unused imports from test files to enhance clarity and maintainability. - Updated error handling in tokenizer tests to ensure proper exception raising for invalid model names. - Streamlined import statements in genie tests for better organization.
- Added .temp/* to the .gitignore file to prevent temporary files from being tracked in the repository.
- Simplified import handling in tokenizer, embeddings, and handshake modules by removing try-except blocks for optional imports. - Updated type hints in various classes to improve code clarity and maintainability. - Ensured consistent use of type annotations in method signatures for better type checking.
- Updated type hints in `mock_tokenizer` and `mock_process_batch` functions to improve code clarity and type checking. - Ensured consistent use of type annotations for better maintainability in test cases.
- Introduced GeminiEmbeddings to the embeddings module. - Updated import statements and __all__ exports to include GeminiEmbeddings. - Registered GeminiEmbeddings in the EmbeddingsRegistry with associated patterns and models for enhanced functionality.
- Added Gemini embedding model to the documentation, including installation instructions and usage examples. - Updated README to reflect the addition of Gemini as a supported embedding provider. - Enhanced code examples to demonstrate the use of the new `GeminiEmbeddings` class and its methods.
- Included a new section in the README for using Google's Gemini embedding models with Chonkie's RecursiveChunker. - Provided a link to a tutorial demonstrating high-quality text embeddings and similarity analysis with Gemini embeddings.
- Updated type hints for mock API response and test functions to improve code clarity and type checking. - Ensured consistent use of type annotations across all test functions for better maintainability.
…okenization and token count operations. Added methods for caching results to reduce redundant calculations, enhancing efficiency during text refinement. Updated tests to validate recursive overlap functionality and ensure context handling is robust across various scenarios.
…pdated logic to compute context size dynamically based on chunk token counts, ensuring accurate results across different chunk sets. Added tests to verify correctness of context size calculations when reusing the refinery with varying input.
…ter across methods. Updated methods to accept effective_context_size for improved flexibility in chunk processing. Enhanced tests to ensure preservation of float context size during refinement with varying chunk sets.
…okenization and token count operations. Added methods to manage cache, including cache_info and clear_cache, to optimize repeated processing of similar text. Updated docstrings for clarity on caching behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @chonknick, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
This pull request introduces significant enhancements to the Chonkie library, primarily focusing on performance improvements for core chunking operations through Cython extensions and expanding embedding model support to include Google's Gemini models. It also includes various documentation updates, new examples, and minor repository maintenance.
Highlights
- Performance Optimization (Cython): Introduced Cython implementations for the fundamental
split
andmerge
operations used in chunking, aiming for substantial performance gains (estimated ~50% faster for merge). This involves adding Cython build configuration and integrating the compiled extensions into theRecursiveChunker
,SentenceChunker
, andSlumberChunker
with Python fallbacks. - Gemini Embeddings Integration: Added a new
GeminiEmbeddings
class to support Google's Gemini embedding models via the GenAI API. TheAutoEmbeddings
class has been updated to recognize Gemini models through direct names and a newgemini://
provider syntax, enhancing flexibility in model loading. - Documentation & Examples: Updated
README.md
andDOCS.md
to reflect Gemini support and clarifyAutoEmbeddings
usage with provider syntax. A new cookbook example (cookbook/examples/gemini_embeddings_with_recursive_chunking.py
) demonstrates using Gemini embeddings with theRecursiveChunker
. - Cloud Chunker Expansion: Added new Cloud Chunker classes (
CodeChunker
,NeuralChunker
,SlumberChunker
) to interface with the Chonkie API for these specific chunking strategies. - Code & Type Hinting Improvements: Refactored type hints across several files (e.g.,
src/chonkie/chunker/token.py
,src/chonkie/types/base.py
) for better clarity and correctness. Removed unnecessarytry...except ImportError
blocks withinTYPE_CHECKING
guards. - Refinery Enhancements: Improved the
OverlapRefinery
with LRU caching for tokenization/token counting and fixed index handling when merging context. Also addressed recursive depth issues. - Embeddings Registry Refactor: Overhauled the
EmbeddingsRegistry
to use separate dictionaries for models, providers, patterns, and types, improving the flexibility and clarity of how embeddings are registered and looked up, especially forAutoEmbeddings
. - Testing Updates: Added comprehensive test suites for the new GeminiEmbeddings, GeminiGenie, Cloud Chunkers, Embeddings Registry, Overlap Refinery, and various tokenizer components. Updated existing tests to use mocking where appropriate to reduce external dependencies in CI.
Changelog
Click here to see the changelog
- .gitignore
- Added ignore rules for compiled Cython files (
*.so
,*.c
) - Added ignore rules for temporary directories (
/.temp/*
,.temp/*
) andCLAUDE.md
- Added ignore rules for compiled Cython files (
- CONTRIBUTING.md
- Updated repository clone URL to the official
chonkie-inc/chonkie.git
- Updated repository clone URL to the official
- DOCS.md
- Added
gemini
to the table of optional installation features (line 80) - Added
GeminiEmbeddings
to the list of available embedding models (line 740) - Updated
AutoEmbeddings
examples to useget_embedding
and demonstrate provider syntax (model2vec://
,st://
) (lines 750-756) - Updated
__call__
method signature description forBaseEmbeddings
(line 771) - Updated
OpenAIGenie
documentation formatting with installation instructions and class definition details (lines 916-947)
- Added
- README.md
- Updated the count of supported embedding models from 6+ to 7+ (line 150)
- Added
GeminiEmbeddings
to the table of supported embedding models (line 160)
- cookbook/README.md
- Added a link to the new Gemini Embeddings with Recursive Chunking example (line 16)
- Added RAGHub to the list of community integrations (line 46)
- cookbook/examples/gemini_embeddings_with_recursive_chunking.py
- Added a new example script demonstrating the use of Gemini embeddings with RecursiveChunker, including initialization, chunking, embedding, similarity computation, and token analysis (lines 1-190)
- pyproject.toml
- Added
cython>=3.0.0
tobuild-system.requires
(line 2) - Bumped project version to
1.0.8
(line 7) - Changed
license
field to a string format (line 11) - Removed duplicate
License :: OSI Approved :: MIT License
classifier (line 34) - Added
cython>=3.0.0
toproject.optional-dependencies.dev
(line 115) - Added
[tool.setuptools.package-data]
configuration for Cython files (*.pyx
,*.pxd
) (lines 140-141) - Added
[tool.setuptools.dynamic]
configuration for version attribute (lines 143-144)
- Added
- setup.py
- Added a new setup script to configure and build Cython extensions (
split.pyx
,merge.pyx
) usingsetuptools
andCython.Build.cythonize
(lines 1-22)
- Added a new setup script to configure and build Cython extensions (
- src/chonkie/init.py
- Imported
GeminiEmbeddings
(line 19) - Bumped
__version__
to1.0.8
(line 68) - Added
GeminiEmbeddings
to the__all__
list (line 122)
- Imported
- src/chonkie/chomp/pipeline.py
- Added return type hint
-> None
to__init__
method (line 6)
- Added return type hint
- src/chonkie/chunker/c_extensions/merge.pyx
- Added Cython implementation for
_merge_splits
andfind_merge_indices
functions, including C array usage and inline binary search for performance optimization (lines 1-250)
- Added Cython implementation for
- src/chonkie/chunker/c_extensions/split.pyx
- Added Cython implementation for
split_text
function, handling delimiters, whitespace, and merging short segments (lines 1-141)
- Added Cython implementation for
- src/chonkie/chunker/code.py
- Removed
try...except ImportError
block fortree_sitter
andtree_sitter_language_pack
withinTYPE_CHECKING
(lines 17-24) - Set
_use_multiprocessing
attribute toFalse
(line 83)
- Removed
- src/chonkie/chunker/recursive.py
- Imported Cython
split_text
and_merge_splits
with fallback logic (lines 19-30) - Modified
_split_text
to use Cythonsplit_text
when available for delimiter-based splitting (lines 141-150) - Renamed original
_merge_splits
to_merge_splits_fallback
(line 252) - Updated
_merge_splits
to use Cython_merge_splits_cython
when available, falling back to_merge_splits_fallback
(lines 236-250)
- Imported Cython
- src/chonkie/chunker/semantic.py
- Imported Cython
split_text
with fallback logic (lines 16-20) - Modified
_split_sentences
to use Cythonsplit_text
when available (lines 237-246)
- Imported Cython
- src/chonkie/chunker/sentence.py
- Imported Cython
find_merge_indices
andsplit_text
with fallback logic (lines 21-32) - Modified
_split_text
to use Cythonsplit_text
when available (lines 180-189) - Removed comment about adding 1 token for spaces in
chunk
method (line 315) - Modified
chunk
method to use Cythonfind_merge_indices
when available for finding split points (lines 328-339)
- Imported Cython
- src/chonkie/chunker/slumber.py
- Imported Cython
split_text
with fallback logic (lines 14-18) - Modified
_split_text
to use Cythonsplit_text
when available (lines 111-136) - Added
_split_text_fallback
method containing the original Python splitting logic (lines 141-168)
- Imported Cython
- src/chonkie/chunker/token.py
- Updated type hint for
chunk_texts
parameter in_create_chunks
fromList[str]
toSequence[str]
(line 66) - Updated type hint for
tokens
parameter in_token_group_generator
fromList[int]
toSequence[int]
(line 106) - Converted token slice to list in
_token_group_generator
yield statement (line 111) - Added type hint
list
toresult
variable in_process_batch
(line 153) - Added type hint
list
tochunks
variable inchunk_batch
(line 199)
- Updated type hint for
- src/chonkie/cloud/init.py
- Imported
CodeChunker
,NeuralChunker
, andSlumberChunker
(lines 5, 7, 12) - Added
CodeChunker
,NeuralChunker
, andSlumberChunker
to the__all__
list (lines 24, 25, 26)
- Imported
- src/chonkie/cloud/chunker/init.py
- Imported
CodeChunker
,NeuralChunker
, andSlumberChunker
(lines 4, 6, 11) - Added
CodeChunker
,NeuralChunker
, andSlumberChunker
to the__all__
list (lines 22, 23, 24)
- Imported
- src/chonkie/cloud/chunker/code.py
- Added a new
CodeChunker
class for interacting with the Chonkie Cloud API for code chunking (lines 1-114)
- Added a new
- src/chonkie/cloud/chunker/neural.py
- Added a new
NeuralChunker
class for interacting with the Chonkie Cloud API for neural chunking (lines 1-102)
- Added a new
- src/chonkie/cloud/chunker/recursive.py
- Removed
API_KEY
class attribute (line 18)
- Removed
- src/chonkie/cloud/chunker/semantic.py
- Removed
API_KEY
class attribute (line 16)
- Removed
- src/chonkie/cloud/chunker/sentence.py
- Removed
API_KEY
class attribute (line 16)
- Removed
- src/chonkie/cloud/chunker/slumber.py
- Added a new
SlumberChunker
class for interacting with the Chonkie Cloud API for slumber chunking (lines 1-154)
- Added a new
- src/chonkie/cloud/chunker/token.py
- Removed
API_KEY
class attribute (line 16)
- Removed
- src/chonkie/embeddings/init.py
- Imported
GeminiEmbeddings
(line 6) - Added
GeminiEmbeddings
to the__all__
list (line 20)
- Imported
- src/chonkie/embeddings/auto.py
- Refactored
get_embeddings
method to prioritizeprovider://model
syntax lookup, then registry match, then fallback to SentenceTransformer (lines 69-107) - Improved warning messages during fallback attempts (lines 87-98)
- Refactored
- src/chonkie/embeddings/base.py
- Renamed
is_available
method to_is_available
to indicate it's an internal method (line 112) - Updated
_import_dependencies
to callself._is_available()
(line 72)
- Renamed
- src/chonkie/embeddings/cohere.py
- Renamed
is_available
class method to_is_available
(line 231) - Updated
_import_dependencies
to callcls._is_available()
(line 243)
- Renamed
- src/chonkie/embeddings/gemini.py
- Added a new
GeminiEmbeddings
class for integrating with Google's GenAI API for embeddings (lines 1-238)
- Added a new
- src/chonkie/embeddings/jina.py
- Removed
try...except ImportError
blocks withinTYPE_CHECKING
guards (lines 11-16)
- Removed
- src/chonkie/embeddings/model2vec.py
- Removed
try...except ImportError
blocks withinTYPE_CHECKING
guards (lines 9-16) - Renamed
is_available
class method to_is_available
(line 72) - Updated
_import_dependencies
to callcls._is_available()
(line 83) - Updated
__repr__
to usemodel
attribute instead ofmodel_name_or_path
(line 94)
- Removed
- src/chonkie/embeddings/openai.py
- Removed
try...except ImportError
blocks withinTYPE_CHECKING
guards (lines 11-16) - Renamed
is_available
method to_is_available
(line 191) - Updated
_import_dependencies
to callself._is_available()
(line 206)
- Removed
- src/chonkie/embeddings/registry.py
- Refactored the registry structure using separate dictionaries for models, providers, patterns, and types (lines 20-23)
- Renamed
register
method toregister_model
(line 26) - Added
register_provider
,register_pattern
, andregister_types
methods (lines 45-91) - Added
get_provider
method (lines 95-97) - Updated
match
logic to prioritize provider prefix, then model name, then pattern (lines 120-137) - Updated
wrap
method to use the type registry (lines 161-164) - Updated existing embedding registrations to use the new registration methods (lines 171-227)
- Added registration for Gemini embeddings (lines 229-233)
- src/chonkie/embeddings/sentence_transformer.py
- Renamed
is_available
class method to_is_available
(line 149) - Updated
_import_dependencies
to callcls._is_available()
(line 163)
- Renamed
- src/chonkie/embeddings/voyageai.py
- Removed
try...except ImportError
blocks withinTYPE_CHECKING
guards (lines 12-20) - Renamed
is_available
method to_is_available
(line 277) - Updated
_import_dependencies
to callself._is_available()
(line 285)
- Removed
- src/chonkie/friends/handshakes/chroma.py
- Removed
try...except ImportError
blocks withinTYPE_CHECKING
guards (lines 15-20)
- Removed
- src/chonkie/friends/handshakes/qdrant.py
- Removed
try...except ImportError
blocks withinTYPE_CHECKING
guards (lines 24-29)
- Removed
- src/chonkie/friends/handshakes/turbopuffer.py
- Removed
try...except ImportError
blocks withinTYPE_CHECKING
guards (lines 15-18)
- Removed
- src/chonkie/genie/gemini.py
- Added a new
GeminiGenie
class for integrating with Google's GenAI API for text and JSON generation (lines 1-238)
- Added a new
- src/chonkie/genie/openai.py
- Removed
try...except ImportError
blocks withinTYPE_CHECKING
guards (lines 9-17)
- Removed
- src/chonkie/refinery/base.py
- Renamed
is_available
abstract method to_is_available
(line 13)
- Renamed
- src/chonkie/refinery/embedding.py
- Updated
_is_available
method to callself.embedding_model._is_available()
(line 41)
- Updated
- src/chonkie/refinery/overlap.py
- Added LRU caching (
lru_cache
) for_get_tokens_impl
and_count_tokens_impl
methods (lines 78-82) - Added
_get_tokens_impl
and_count_tokens_impl
methods for caching (lines 89-95) - Added
clear_cache
andcache_info
methods (lines 97-111) - Modified
_split_text
to accepteffective_context_size
and use it for token-based splitting (lines 113, 133-134) - Added
_get_token_counts_cached
method to use cached token counting (lines 141-143) - Modified
_group_splits
to accepteffective_context_size
and use it in the token count check (lines 145, 160) - Modified
_prefix_overlap_token
to accepteffective_context_size
and use it for token slicing (lines 168, 185, 189) - Modified
_recursive_overlap
to accepteffective_context_size
and pass it recursively (lines 191, 215, 228) - Modified
_prefix_overlap_recursive
to accepteffective_context_size
and pass it (lines 238, 252) - Modified
_get_prefix_overlap_context
to accepteffective_context_size
and pass it to helper methods (lines 254, 264, 266) - Modified
_refine_prefix
to accepteffective_context_size
and pass it to_get_prefix_overlap_context
(lines 270, 287) - Removed index adjustment logic when merging context in
_refine_prefix
(line 295-298) - Updated token count update in
_refine_prefix
to use cached token counting (lines 300-302) - Modified
_suffix_overlap_token
to accepteffective_context_size
and use it for token slicing (lines 307, 324, 328) - Modified
_suffix_overlap_recursive
to accepteffective_context_size
and pass it (lines 330, 344) - Modified
_get_suffix_overlap_context
to accepteffective_context_size
and pass it to helper methods (lines 346, 356, 358) - Modified
_refine_suffix
to accepteffective_context_size
and pass it to_get_suffix_overlap_context
(lines 362, 379) - Removed index adjustment logic when merging context in
_refine_suffix
(lines 387-390) - Updated token count update in
_refine_suffix
to use cached token counting (lines 391-393) - Modified
_get_overlap_context_size
to calculate and return the effective size without modifyingself.context_size
(lines 405-408) - Modified
refine
to geteffective_context_size
and pass it to_refine_prefix
or_refine_suffix
(lines 435, 439, 441)
- Added LRU caching (
- src/chonkie/tokenizer.py
- Removed
try...except ImportError
blocks withinTYPE_CHECKING
guards (lines 11-26) - Added
defaulttoken2id
method toBaseTokenizer
for pickling compatibility (lines 28-34)
- Removed
- src/chonkie/types/base.py
- Added return type hint
-> None
to__post_init__
method (line 24) - Added return type hint
-> int
to__len__
method (line 41) - Added return type hint
-> Iterator[str]
to__iter__
method (line 104) - Added return type hint
-> str
to__getitem__
method (line 108)
- Added return type hint
- src/chonkie/types/code.py
- Removed
try...except ImportError
block withinTYPE_CHECKING
guards (lines 9-12)
- Removed
- src/chonkie/types/sentence.py
- Added return type hint
-> None
to__post_init__
method (line 26)
- Added return type hint
- tests/chunkers/test_neural_chunker.py
- Added a new test file with comprehensive tests for the NeuralChunker class (lines 1-444)
- tests/chunkers/test_sdpm_chunker.py
- Updated test file with more comprehensive tests for SDPMChunker, including initialization, basic functionality, internal methods, edge cases, representation, parameter variations, recipe feature, and batch processing (lines 1-529)
- Added fixtures for multi-topic text and short text (lines 23-43)
- Removed tests requiring specific API keys (OpenAI, Cohere) by focusing on mocked embeddings (lines 3-8, 35-60, 124-143)
- tests/chunkers/test_semantic_chunker.py
- Added comprehensive tests for SemanticChunker, including parameter validation, mode configuration, threshold types, internal methods, threshold calculation, and edge cases (lines 399-850)
- tests/chunkers/test_slumber_chunker.py
- Added a new test file with comprehensive tests for the SlumberChunker class, including initialization, internal methods, chunking, edge cases, prompt generation, representation, and integration (lines 1-627)
- tests/cloud/test_cloud_code_chunker.py
- Added a new test file with comprehensive tests for the Cloud Code Chunker (lines 1-517)
- tests/cloud/test_cloud_late_chunker.py
- Added a new test file with comprehensive tests for the Cloud Late Chunker (lines 1-225)
- tests/cloud/test_cloud_neural_chunker.py
- Added a new test file with comprehensive tests for the Cloud Neural Chunker (lines 1-222)
- tests/cloud/test_cloud_recursive_chunker.py
- Updated test file with tests for Cloud Recursive Chunker using mocking (lines 12-57, 93-160)
- tests/cloud/test_cloud_sdpm_chunker.py
- Updated test file with tests for Cloud SDPM Chunker using mocking (lines 11-41, 129-210)
- tests/cloud/test_cloud_slumber_chunker.py
- Added a new test file with comprehensive tests for the Cloud Slumber Chunker (lines 1-333)
- tests/embeddings/test_auto_embeddings.py
- Added more comprehensive tests for AutoEmbeddings, including provider prefixes, different input types, and error handling (lines 18-166)
- tests/embeddings/test_cohere_embeddings.py
- Added mocking for Cohere API calls and tokenizer download to enable tests without API key (lines 14-46)
- Added more comprehensive tests for CohereEmbeddings (lines 71-163)
- tests/embeddings/test_embeddings_registry.py
- Added a new test file with comprehensive tests for the EmbeddingsRegistry class (lines 1-339)
- tests/embeddings/test_gemini_embeddings.py
- Added a new test file with comprehensive tests for GeminiEmbeddings, including mocking and real API tests (if key available) (lines 1-344)
- tests/embeddings/test_jina_embeddings.py
- Added comprehensive tests for JinaEmbeddings, including mocking and real API tests (if key available) (lines 15-606)
- tests/embeddings/test_model2vec_embeddings.py
- Updated test to use
_is_available
method (line 84)
- Updated test to use
- tests/embeddings/test_openai_embeddings.py
- Updated test to use
_is_available
method (line 129)
- Updated test to use
- tests/embeddings/test_sentence_transformer_embeddings.py
- Updated test to use
_is_available
method (line 112)
- Updated test to use
- tests/embeddings/test_voyageai_embeddings.py
- Added comprehensive tests for VoyageAIEmbeddings, including mocking and real API tests (if key available) (lines 14-594)
- tests/genie/test_base_genie.py
- Added a new test file with comprehensive tests for the BaseGenie abstract class and its default batch implementations (lines 1-371)
- tests/genie/test_gemini_genie.py
- Added a new test file with comprehensive tests for the GeminiGenie class (lines 1-246)
- tests/genie/test_openai_genie.py
- Added a new test file with comprehensive tests for the OpenAIGenie class (lines 1-201)
- tests/refinery/init.py
- Added an empty
__init__.py
file to thetests/refinery
directory (lines 1)
- Added an empty
- tests/refinery/test_embedding_refinery.py
- Added a new test file with comprehensive tests for the EmbeddingsRefinery, including mocking (lines 1-350)
- tests/refinery/test_overlap_refinery.py
- Added a new test file with comprehensive tests for the OverlapRefinery, including caching, recursive mode, and edge cases (lines 1-785)
- tests/test_tokenizer.py
- Added comprehensive tests for CharacterTokenizer, WordTokenizer, and the unified Tokenizer class, including edge cases, error handling, and backend-specific paths (lines 512-1081)
- tests/test_viz.py
- Added a new test file with comprehensive tests for the Visualizer class, including print, save, theme handling, and edge cases (lines 1-737)
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Code split and merged fast,
Cython speeds the task at last.
Gemini joins the crew,
Chunking, embedding anew,
A hippo's work, built to last.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant enhancements, primarily by adding Cython-based performance improvements for chunking operations (split
and merge
) and integrating support for Google's Gemini embedding models. The refactoring of AutoEmbeddings
and EmbeddingsRegistry
improves model loading logic and clarity. Documentation has been updatedcommendably to reflect these new features and API changes. The addition of new example files and comprehensive test suites is also a great step towards ensuring quality and usability.
Overall, the changes are well-implemented and address the goals outlined. There are a few areas where clarification or minor adjustments could be beneficial, detailed in the specific comments.
Summary of Findings
- Cython Performance Enhancements: The introduction of Cython extensions for
split
andmerge
operations is a key highlight and should provide significant performance benefits for chunking. - Gemini Embeddings Support: Support for Google's Gemini embedding models has been added, expanding the range of available embedding options.
- Embeddings Loading Refactor: The
AutoEmbeddings
andEmbeddingsRegistry
classes have been substantially refactored, improving the clarity and flexibility of loading and managing embedding models, including new provider alias syntax. - API Changes: There are API changes, notably
AutoEmbeddings.get_embedding()
instead of direct instantiation, which should be clearly communicated to users. - OverlapRefinery Improvements: The
OverlapRefinery
has been improved with LRU caching for tokenization and more robust handling of float-basedcontext_size
. - Behavioral Change in CodeChunker: The default for
_use_multiprocessing
inCodeChunker
was changed toFalse
. The impact and reasoning for this should be clear. - Test Coverage: A significant number of new tests have been added, which is excellent for maintainability and reliability.
- Documentation Updates:
DOCS.md
andREADME.md
have been updated to reflect new features and API changes, which is crucial for users. - Minor: .gitignore entry: The
.gitignore
file includes a very specific compiled file path (src/chonkie/chunker/c_extensions/token_chunker.cpython-310-darwin.so
). While*.so
covers this, specific paths for local build artifacts are usually not needed if generic patterns suffice. This is a low severity finding and was not added as a comment due to review settings. - Minor: Cookbook Example Date: The new Gemini example file (
cookbook/examples/gemini_embeddings_with_recursive_chunking.py
) has a date of2025
. This is a low severity finding and was not added as a comment due to review settings. - Minor: pyproject.toml License Classifier: In
pyproject.toml
, thelicense
field was changed to"MIT"
(which is fine), but the corresponding classifier"License :: OSI Approved :: MIT License"
was removed. Usually, these are kept together. This is a low severity finding and was not added as a comment due to review settings.
Merge Readiness
This pull request introduces substantial improvements and new features. The Cython extensions are a great step for performance, and Gemini support is a valuable addition. The refactoring in the embeddings system also enhances maintainability.
Before merging, I recommend addressing the medium
severity comments, particularly clarifying the _use_multiprocessing
change in CodeChunker
, confirming the resolution of SlumberChunker
issues, and considering the Gemini API batching question for potential future optimization. The removal of EmbeddingsRegistry.list_available
might also be worth a quick discussion on whether an alternative is needed.
Once these points are clarified or addressed, the PR should be in good shape. As I am an AI, I am not authorized to approve pull requests; please ensure other reviewers approve this code before merging.
❌ 36 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR introduces significant performance optimizations through Cython extensions for text chunking operations, along with support for Google's Gemini embedding models.
- Added Cython extensions
split.pyx
andmerge.pyx
for optimized text chunking operations, claiming 48-50% performance improvement - Implemented
GeminiEmbeddings
class with comprehensive retry logic, token counting, and support for the latest Gemini embedding models - Added LRU caching (maxsize=8192) in
OverlapRefinery
for tokenization operations to improve performance - Introduced provider-based syntax for loading embeddings (e.g.,
gemini://
) inAutoEmbeddings
with improved error handling - Comprehensive test coverage added for new features including mocked API responses and real integration tests
81 file(s) reviewed, 4 comment(s)
Edit PR Review Bot Settings | Greptile
.gitignore
Outdated
notebooks/* | ||
src/chonkie/chunker/c_extensions/token_chunker.cpython-310-darwin.so | ||
CLAUDE.md | ||
.temp/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Duplicate entries for .temp/* - remove one of them to avoid confusion
/.temp/* | |
notebooks/* | |
src/chonkie/chunker/c_extensions/token_chunker.cpython-310-darwin.so | |
CLAUDE.md | |
.temp/* | |
/.temp/* | |
notebooks/* | |
src/chonkie/chunker/c_extensions/token_chunker.cpython-310-darwin.so | |
CLAUDE.md |
.gitignore
Outdated
/.temp/* | ||
notebooks/* | ||
src/chonkie/chunker/c_extensions/token_chunker.cpython-310-darwin.so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: This platform-specific pattern should be replaced with a more generic .cpython-.so
@@ -1,5 +1,5 @@ | |||
[build-system] | |||
requires = ["setuptools>=45", "wheel"] | |||
requires = ["setuptools>=45", "wheel", "cython>=3.0.0"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider pinning Cython to a specific version range (e.g. cython>=3.0.0,<4.0.0
) to prevent future compatibility issues
if delim is None: | ||
if whitespace_mode: | ||
# Split on whitespace - for word-level splitting | ||
splits = text.split(" ") # Split on spaces specifically, not all whitespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: splitting on space character only may miss other whitespace characters like tabs and newlines. Consider using str.split() without arguments for all whitespace
This pull request introduces several updates to enhance functionality, improve documentation, and add support for new features in the Chonkie library. The most significant changes include adding support for Google's Gemini embedding models, updating documentation to reflect new capabilities, and introducing Cython extensions for performance improvements.
New Features and Enhancements:
GeminiEmbeddings
to integrate Google's Gemini embedding models. UpdatedAutoEmbeddings
to support Gemini models with multiple loading options (gemini://
syntax, direct model name, etc.). [1] [2]split
andmerge
) for improved performance in chunking operations. Added configuration for Cython insetup.py
andpyproject.toml
. [1] [2] [3]Documentation Updates:
DOCS.md
andREADME.md
to include Gemini embeddings in the list of supported models and their usage examples. [1] [2]cookbook/examples/gemini_embeddings_with_recursive_chunking.py
demonstrating how to use Gemini embeddings with the RecursiveChunker. [1] [2]DOCS.md
to reflect changes inAutoEmbeddings
and embedding methods (e.g.,embed
andembed_batch
).Miscellaneous Changes:
CONTRIBUTING.md
to use the official repository URL.1.0.8
inpyproject.toml
and changed the license field to a string format.These changes collectively enhance Chonkie's capabilities, improve user experience, and optimize performance for advanced use cases.