Feat: Create Cython functions for `split` and `merge` basic ops for chunking! #163

chonknick · 2025-05-25T22:45:36Z

This pull request introduces several updates to enhance functionality, improve documentation, and add support for new features in the Chonkie library. The most significant changes include adding support for Google's Gemini embedding models, updating documentation to reflect new capabilities, and introducing Cython extensions for performance improvements.

New Features and Enhancements:

Gemini Embeddings Support: Added GeminiEmbeddings to integrate Google's Gemini embedding models. Updated AutoEmbeddings to support Gemini models with multiple loading options (gemini:// syntax, direct model name, etc.). [1] [2]
Cython Extensions: Introduced Cython-based extensions (split and merge) for improved performance in chunking operations. Added configuration for Cython in setup.py and pyproject.toml. [1] [2] [3]

Documentation Updates:

Embedding Models: Updated DOCS.md and README.md to include Gemini embeddings in the list of supported models and their usage examples. [1] [2]
Cookbook Examples: Added a new example in cookbook/examples/gemini_embeddings_with_recursive_chunking.py demonstrating how to use Gemini embeddings with the RecursiveChunker. [1] [2]
API Changes: Updated method signatures and examples in DOCS.md to reflect changes in AutoEmbeddings and embedding methods (e.g., embed and embed_batch).

Miscellaneous Changes:

Repository Updates: Updated the clone URL in CONTRIBUTING.md to use the official repository URL.
Version Bump: Incremented the library version to 1.0.8 in pyproject.toml and changed the license field to a string format.

These changes collectively enhance Chonkie's capabilities, improve user experience, and optimize performance for advanced use cases.

… if we can speed it up!

…utilize it - Introduced a new Cython module for optimized text splitting. - Updated `RecursiveChunker`, `SemanticChunker`, `SentenceChunker`, and `SlumberChunker` to use the new `split_text` function when available. - Enhanced fallback mechanisms for text splitting in case Cython is not available. - Added `.temp` directory to `.gitignore` and included `CLAUDE.md` file.

- Introduced a new Cython module for merging text splits, improving performance by approximately 50%. - Updated `RecursiveChunker` to utilize the optimized merge function when available, with a Python fallback. - Enhanced documentation for the merging process and added error handling for input validation.

…unker - Introduced `find_merge_indices` in the Cython module to enhance performance for merging token counts. - Updated `SentenceChunker` to utilize the new Cython function when available, with a fallback to the existing Python implementation. - Improved handling of cumulative token counts for better efficiency in chunking operations.

- Updated setup.py to include documentation and removed the token_chunker extension, which is no longer in use. - Simplified docstring formatting in RecursiveChunker for clarity.

- Deleted `test_cython_token_chunker.py`, which contained tests for the Cython token chunking functionality that is no longer in use. - This cleanup helps streamline the codebase by removing obsolete tests.

…nd EmbeddingsRegistry. Introduced provider alias support, improved error handling, and streamlined model registration methods for better clarity and maintainability.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…k for default SentenceTransformerEmbeddings in case of registry lookup failure.

…_model for jina-embeddings-v2 types, enhancing consistency in model registration.

…anging 'type' to 'type_alias' for better readability in the embedding registration process.

…ppress type checking errors

… clarity

…the imports and __all__ list for improved functionality.

…various chunkers. Removed API_KEY from class variable and enhanced exception messages for better clarity. Added comprehensive tests for LateChunker, NeuralChunker, SDPMChunker, and SlumberChunker to ensure robust functionality and error handling.

…for API connectivity issues and improving error messages for better user guidance. Ensure clarity in API key requirements and response handling.

…r, SentenceChunker, and TokenChunker for improved security and consistency across chunker implementations.

…ity and consistency across chunker implementations.

… API connectivity issues and invalid responses, enhancing user experience and support contact information.

…f lambda

- Introduced tests for edge cases, including empty text, special characters, and whitespace handling. - Implemented error handling tests for invalid tokens in both character and word tokenizers. - Verified consistency across encoding, decoding, and token counting operations. - Added tests for batch operations and error propagation. - Enhanced coverage for tokenizer initialization and backend detection accuracy.

- Introduced a new test suite for the SlumberChunker, covering initialization, chunking functionality, and edge cases. - Implemented tests for various text splitting methods, including whitespace and delimiter-based approaches. - Validated chunk properties and ensured proper handling of different input scenarios, including empty text and large documents. - Enhanced test coverage for prompt generation and genie interaction, ensuring robust functionality of the SlumberChunker.

- Renamed `is_available` method to `_is_available` in all embeddings and refinery classes for consistency and to indicate that these methods are intended for internal use. - Updated corresponding calls in the implementations to reflect the new method name. - Adjusted tests to verify the functionality of the renamed method.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

- Introduced a new test suite for the JSONPorter class, covering initialization, export to JSON and JSONL formats, and handling of empty chunk lists. - Validated chunk serialization, context inclusion, and indentation in exported files. - Implemented tests for large chunk lists and Unicode content handling. - Ensured proper error handling for file permission issues and support for Path objects.

… annotations - Added return type annotations to all test functions for improved clarity and type checking. - Updated the temp_dir fixture to specify its return type as a Generator.

- Restructured test cases into classes for better organization and clarity. - Added tests for Model2Vec and SentenceTransformer embeddings, including actual embedding generation. - Implemented provider prefix tests for OpenAI, Cohere, VoyageAI, and Jina embeddings. - Enhanced error handling tests for invalid provider prefixes and model identifiers. - Included tests for handling existing embeddings instances and custom embeddings objects.

- Restructured test cases into classes for improved organization and clarity. - Added tests for initialization with explicit and environment API keys, including error handling for missing keys. - Implemented tests for custom model initialization and tokenizer handling. - Enhanced tests for embedding methods, including single and batch embeddings with mocked API responses. - Validated similarity calculations and error handling for various edge cases.

- Restructured test cases into classes for improved organization and clarity. - Added tests for initialization with default and custom models, including error handling for invalid models and missing API keys. - Enhanced tests for embedding methods, including synchronous and asynchronous embedding with mocked API responses. - Implemented tests for token counting, dimension properties, and similarity checks between embeddings. - Validated handling of edge cases and error scenarios, including empty inputs and API errors.

…tability

- Introduced mocking for API responses to enhance test reliability and avoid external dependencies. - Updated test cases for CodeChunker, LateChunker, RecursiveChunker, SDPMChunker, SemanticChunker, and SlumberChunker to include API key handling. - Added comprehensive tests for various scenarios, including single and batch text processing, empty inputs, and return type validations. - Improved error handling tests for invalid configurations and ensured consistent behavior across chunkers. - Enhanced readability and organization of test cases for better maintainability.

- Introduced mocking for Cohere API dependencies to avoid real API calls during tests. - Updated test cases to use a test API key, ensuring consistent behavior without requiring environment variables. - Added a new test for real API integration, marked as disabled for CI, to validate functionality when the API key is available. - Improved assertions in similarity tests to accommodate a broader range of expected values.

- Removed unused imports from test files to enhance clarity and maintainability. - Updated error handling in tokenizer tests to ensure proper exception raising for invalid model names. - Streamlined import statements in genie tests for better organization.

- Added .temp/* to the .gitignore file to prevent temporary files from being tracked in the repository.

- Simplified import handling in tokenizer, embeddings, and handshake modules by removing try-except blocks for optional imports. - Updated type hints in various classes to improve code clarity and maintainability. - Ensured consistent use of type annotations in method signatures for better type checking.

- Updated type hints in `mock_tokenizer` and `mock_process_batch` functions to improve code clarity and type checking. - Ensured consistent use of type annotations for better maintainability in test cases.

- Introduced GeminiEmbeddings to the embeddings module. - Updated import statements and __all__ exports to include GeminiEmbeddings. - Registered GeminiEmbeddings in the EmbeddingsRegistry with associated patterns and models for enhanced functionality.

- Added Gemini embedding model to the documentation, including installation instructions and usage examples. - Updated README to reflect the addition of Gemini as a supported embedding provider. - Enhanced code examples to demonstrate the use of the new `GeminiEmbeddings` class and its methods.

- Included a new section in the README for using Google's Gemini embedding models with Chonkie's RecursiveChunker. - Provided a link to a tutorial demonstrating high-quality text embeddings and similarity analysis with Gemini embeddings.

- Updated type hints for mock API response and test functions to improve code clarity and type checking. - Ensured consistent use of type annotations across all test functions for better maintainability.

…okenization and token count operations. Added methods for caching results to reduce redundant calculations, enhancing efficiency during text refinement. Updated tests to validate recursive overlap functionality and ensure context handling is robust across various scenarios.

…pdated logic to compute context size dynamically based on chunk token counts, ensuring accurate results across different chunk sets. Added tests to verify correctness of context size calculations when reusing the refinery with varying input.

…ter across methods. Updated methods to accept effective_context_size for improved flexibility in chunk processing. Enhanced tests to ensure preservation of float context size during refinement with varying chunk sets.

…okenization and token count operations. Added methods to manage cache, including cache_info and clear_cache, to optimize repeated processing of similar text. Updated docstrings for clarity on caching behavior.

gemini-code-assist

Hello @chonknick, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request introduces significant enhancements to the Chonkie library, primarily focusing on performance improvements for core chunking operations through Cython extensions and expanding embedding model support to include Google's Gemini models. It also includes various documentation updates, new examples, and minor repository maintenance.

Highlights

Performance Optimization (Cython): Introduced Cython implementations for the fundamental split and merge operations used in chunking, aiming for substantial performance gains (estimated ~50% faster for merge). This involves adding Cython build configuration and integrating the compiled extensions into the RecursiveChunker, SentenceChunker, and SlumberChunker with Python fallbacks.
Gemini Embeddings Integration: Added a new GeminiEmbeddings class to support Google's Gemini embedding models via the GenAI API. The AutoEmbeddings class has been updated to recognize Gemini models through direct names and a new gemini:// provider syntax, enhancing flexibility in model loading.
Documentation & Examples: Updated README.md and DOCS.md to reflect Gemini support and clarify AutoEmbeddings usage with provider syntax. A new cookbook example (cookbook/examples/gemini_embeddings_with_recursive_chunking.py) demonstrates using Gemini embeddings with the RecursiveChunker.
Cloud Chunker Expansion: Added new Cloud Chunker classes (CodeChunker, NeuralChunker, SlumberChunker) to interface with the Chonkie API for these specific chunking strategies.
Code & Type Hinting Improvements: Refactored type hints across several files (e.g., src/chonkie/chunker/token.py, src/chonkie/types/base.py) for better clarity and correctness. Removed unnecessary try...except ImportError blocks within TYPE_CHECKING guards.
Refinery Enhancements: Improved the OverlapRefinery with LRU caching for tokenization/token counting and fixed index handling when merging context. Also addressed recursive depth issues.
Embeddings Registry Refactor: Overhauled the EmbeddingsRegistry to use separate dictionaries for models, providers, patterns, and types, improving the flexibility and clarity of how embeddings are registered and looked up, especially for AutoEmbeddings.
Testing Updates: Added comprehensive test suites for the new GeminiEmbeddings, GeminiGenie, Cloud Chunkers, Embeddings Registry, Overlap Refinery, and various tokenizer components. Updated existing tests to use mocking where appropriate to reduce external dependencies in CI.

Changelog

Click here to see the changelog

.gitignore
- Added ignore rules for compiled Cython files (*.so, *.c)
- Added ignore rules for temporary directories (/.temp/*, .temp/*) and CLAUDE.md
CONTRIBUTING.md
- Updated repository clone URL to the official chonkie-inc/chonkie.git
DOCS.md
- Added gemini to the table of optional installation features (line 80)
- Added GeminiEmbeddings to the list of available embedding models (line 740)
- Updated AutoEmbeddings examples to use get_embedding and demonstrate provider syntax (model2vec://, st://) (lines 750-756)
- Updated __call__ method signature description for BaseEmbeddings (line 771)
- Updated OpenAIGenie documentation formatting with installation instructions and class definition details (lines 916-947)
README.md
- Updated the count of supported embedding models from 6+ to 7+ (line 150)
- Added GeminiEmbeddings to the table of supported embedding models (line 160)
cookbook/README.md
- Added a link to the new Gemini Embeddings with Recursive Chunking example (line 16)
- Added RAGHub to the list of community integrations (line 46)
cookbook/examples/gemini_embeddings_with_recursive_chunking.py
- Added a new example script demonstrating the use of Gemini embeddings with RecursiveChunker, including initialization, chunking, embedding, similarity computation, and token analysis (lines 1-190)
pyproject.toml
- Added cython>=3.0.0 to build-system.requires (line 2)
- Bumped project version to 1.0.8 (line 7)
- Changed license field to a string format (line 11)
- Removed duplicate License :: OSI Approved :: MIT License classifier (line 34)
- Added cython>=3.0.0 to project.optional-dependencies.dev (line 115)
- Added [tool.setuptools.package-data] configuration for Cython files (*.pyx, *.pxd) (lines 140-141)
- Added [tool.setuptools.dynamic] configuration for version attribute (lines 143-144)
setup.py
- Added a new setup script to configure and build Cython extensions (split.pyx, merge.pyx) using setuptools and Cython.Build.cythonize (lines 1-22)
src/chonkie/init.py
- Imported GeminiEmbeddings (line 19)
- Bumped __version__ to 1.0.8 (line 68)
- Added GeminiEmbeddings to the __all__ list (line 122)
src/chonkie/chomp/pipeline.py
- Added return type hint -> None to __init__ method (line 6)
src/chonkie/chunker/c_extensions/merge.pyx
- Added Cython implementation for _merge_splits and find_merge_indices functions, including C array usage and inline binary search for performance optimization (lines 1-250)
src/chonkie/chunker/c_extensions/split.pyx
- Added Cython implementation for split_text function, handling delimiters, whitespace, and merging short segments (lines 1-141)
src/chonkie/chunker/code.py
- Removed try...except ImportError block for tree_sitter and tree_sitter_language_pack within TYPE_CHECKING (lines 17-24)
- Set _use_multiprocessing attribute to False (line 83)
src/chonkie/chunker/recursive.py
- Imported Cython split_text and _merge_splits with fallback logic (lines 19-30)
- Modified _split_text to use Cython split_text when available for delimiter-based splitting (lines 141-150)
- Renamed original _merge_splits to _merge_splits_fallback (line 252)
- Updated _merge_splits to use Cython _merge_splits_cython when available, falling back to _merge_splits_fallback (lines 236-250)
src/chonkie/chunker/semantic.py
- Imported Cython split_text with fallback logic (lines 16-20)
- Modified _split_sentences to use Cython split_text when available (lines 237-246)
src/chonkie/chunker/sentence.py
- Imported Cython find_merge_indices and split_text with fallback logic (lines 21-32)
- Modified _split_text to use Cython split_text when available (lines 180-189)
- Removed comment about adding 1 token for spaces in chunk method (line 315)
- Modified chunk method to use Cython find_merge_indices when available for finding split points (lines 328-339)
src/chonkie/chunker/slumber.py
- Imported Cython split_text with fallback logic (lines 14-18)
- Modified _split_text to use Cython split_text when available (lines 111-136)
- Added _split_text_fallback method containing the original Python splitting logic (lines 141-168)
src/chonkie/chunker/token.py
- Updated type hint for chunk_texts parameter in _create_chunks from List[str] to Sequence[str] (line 66)
- Updated type hint for tokens parameter in _token_group_generator from List[int] to Sequence[int] (line 106)
- Converted token slice to list in _token_group_generator yield statement (line 111)
- Added type hint list to result variable in _process_batch (line 153)
- Added type hint list to chunks variable in chunk_batch (line 199)
src/chonkie/cloud/init.py
- Imported CodeChunker, NeuralChunker, and SlumberChunker (lines 5, 7, 12)
- Added CodeChunker, NeuralChunker, and SlumberChunker to the __all__ list (lines 24, 25, 26)
src/chonkie/cloud/chunker/init.py
- Imported CodeChunker, NeuralChunker, and SlumberChunker (lines 4, 6, 11)
- Added CodeChunker, NeuralChunker, and SlumberChunker to the __all__ list (lines 22, 23, 24)
src/chonkie/cloud/chunker/code.py
- Added a new CodeChunker class for interacting with the Chonkie Cloud API for code chunking (lines 1-114)
src/chonkie/cloud/chunker/neural.py
- Added a new NeuralChunker class for interacting with the Chonkie Cloud API for neural chunking (lines 1-102)
src/chonkie/cloud/chunker/recursive.py
- Removed API_KEY class attribute (line 18)
src/chonkie/cloud/chunker/semantic.py
- Removed API_KEY class attribute (line 16)
src/chonkie/cloud/chunker/sentence.py
- Removed API_KEY class attribute (line 16)
src/chonkie/cloud/chunker/slumber.py
- Added a new SlumberChunker class for interacting with the Chonkie Cloud API for slumber chunking (lines 1-154)
src/chonkie/cloud/chunker/token.py
- Removed API_KEY class attribute (line 16)
src/chonkie/embeddings/init.py
- Imported GeminiEmbeddings (line 6)
- Added GeminiEmbeddings to the __all__ list (line 20)
src/chonkie/embeddings/auto.py
- Refactored get_embeddings method to prioritize provider://model syntax lookup, then registry match, then fallback to SentenceTransformer (lines 69-107)
- Improved warning messages during fallback attempts (lines 87-98)
src/chonkie/embeddings/base.py
- Renamed is_available method to _is_available to indicate it's an internal method (line 112)
- Updated _import_dependencies to call self._is_available() (line 72)
src/chonkie/embeddings/cohere.py
- Renamed is_available class method to _is_available (line 231)
- Updated _import_dependencies to call cls._is_available() (line 243)
src/chonkie/embeddings/gemini.py
- Added a new GeminiEmbeddings class for integrating with Google's GenAI API for embeddings (lines 1-238)
src/chonkie/embeddings/jina.py
- Removed try...except ImportError blocks within TYPE_CHECKING guards (lines 11-16)
src/chonkie/embeddings/model2vec.py
- Removed try...except ImportError blocks within TYPE_CHECKING guards (lines 9-16)
- Renamed is_available class method to _is_available (line 72)
- Updated _import_dependencies to call cls._is_available() (line 83)
- Updated __repr__ to use model attribute instead of model_name_or_path (line 94)
src/chonkie/embeddings/openai.py
- Removed try...except ImportError blocks within TYPE_CHECKING guards (lines 11-16)
- Renamed is_available method to _is_available (line 191)
- Updated _import_dependencies to call self._is_available() (line 206)
src/chonkie/embeddings/registry.py
- Refactored the registry structure using separate dictionaries for models, providers, patterns, and types (lines 20-23)
- Renamed register method to register_model (line 26)
- Added register_provider, register_pattern, and register_types methods (lines 45-91)
- Added get_provider method (lines 95-97)
- Updated match logic to prioritize provider prefix, then model name, then pattern (lines 120-137)
- Updated wrap method to use the type registry (lines 161-164)
- Updated existing embedding registrations to use the new registration methods (lines 171-227)
- Added registration for Gemini embeddings (lines 229-233)
src/chonkie/embeddings/sentence_transformer.py
- Renamed is_available class method to _is_available (line 149)
- Updated _import_dependencies to call cls._is_available() (line 163)
src/chonkie/embeddings/voyageai.py
- Removed try...except ImportError blocks within TYPE_CHECKING guards (lines 12-20)
- Renamed is_available method to _is_available (line 277)
- Updated _import_dependencies to call self._is_available() (line 285)
src/chonkie/friends/handshakes/chroma.py
- Removed try...except ImportError blocks within TYPE_CHECKING guards (lines 15-20)
src/chonkie/friends/handshakes/qdrant.py
- Removed try...except ImportError blocks within TYPE_CHECKING guards (lines 24-29)
src/chonkie/friends/handshakes/turbopuffer.py
- Removed try...except ImportError blocks within TYPE_CHECKING guards (lines 15-18)
src/chonkie/genie/gemini.py
- Added a new GeminiGenie class for integrating with Google's GenAI API for text and JSON generation (lines 1-238)
src/chonkie/genie/openai.py
- Removed try...except ImportError blocks within TYPE_CHECKING guards (lines 9-17)
src/chonkie/refinery/base.py
- Renamed is_available abstract method to _is_available (line 13)
src/chonkie/refinery/embedding.py
- Updated _is_available method to call self.embedding_model._is_available() (line 41)
src/chonkie/refinery/overlap.py
- Added LRU caching (lru_cache) for _get_tokens_impl and _count_tokens_impl methods (lines 78-82)
- Added _get_tokens_impl and _count_tokens_impl methods for caching (lines 89-95)
- Added clear_cache and cache_info methods (lines 97-111)
- Modified _split_text to accept effective_context_size and use it for token-based splitting (lines 113, 133-134)
- Added _get_token_counts_cached method to use cached token counting (lines 141-143)
- Modified _group_splits to accept effective_context_size and use it in the token count check (lines 145, 160)
- Modified _prefix_overlap_token to accept effective_context_size and use it for token slicing (lines 168, 185, 189)
- Modified _recursive_overlap to accept effective_context_size and pass it recursively (lines 191, 215, 228)
- Modified _prefix_overlap_recursive to accept effective_context_size and pass it (lines 238, 252)
- Modified _get_prefix_overlap_context to accept effective_context_size and pass it to helper methods (lines 254, 264, 266)
- Modified _refine_prefix to accept effective_context_size and pass it to _get_prefix_overlap_context (lines 270, 287)
- Removed index adjustment logic when merging context in _refine_prefix (line 295-298)
- Updated token count update in _refine_prefix to use cached token counting (lines 300-302)
- Modified _suffix_overlap_token to accept effective_context_size and use it for token slicing (lines 307, 324, 328)
- Modified _suffix_overlap_recursive to accept effective_context_size and pass it (lines 330, 344)
- Modified _get_suffix_overlap_context to accept effective_context_size and pass it to helper methods (lines 346, 356, 358)
- Modified _refine_suffix to accept effective_context_size and pass it to _get_suffix_overlap_context (lines 362, 379)
- Removed index adjustment logic when merging context in _refine_suffix (lines 387-390)
- Updated token count update in _refine_suffix to use cached token counting (lines 391-393)
- Modified _get_overlap_context_size to calculate and return the effective size without modifying self.context_size (lines 405-408)
- Modified refine to get effective_context_size and pass it to _refine_prefix or _refine_suffix (lines 435, 439, 441)
src/chonkie/tokenizer.py
- Removed try...except ImportError blocks within TYPE_CHECKING guards (lines 11-26)
- Added defaulttoken2id method to BaseTokenizer for pickling compatibility (lines 28-34)
src/chonkie/types/base.py
- Added return type hint -> None to __post_init__ method (line 24)
- Added return type hint -> int to __len__ method (line 41)
- Added return type hint -> Iterator[str] to __iter__ method (line 104)
- Added return type hint -> str to __getitem__ method (line 108)
src/chonkie/types/code.py
- Removed try...except ImportError block within TYPE_CHECKING guards (lines 9-12)
src/chonkie/types/sentence.py
- Added return type hint -> None to __post_init__ method (line 26)
tests/chunkers/test_neural_chunker.py
- Added a new test file with comprehensive tests for the NeuralChunker class (lines 1-444)
tests/chunkers/test_sdpm_chunker.py
- Updated test file with more comprehensive tests for SDPMChunker, including initialization, basic functionality, internal methods, edge cases, representation, parameter variations, recipe feature, and batch processing (lines 1-529)
- Added fixtures for multi-topic text and short text (lines 23-43)
- Removed tests requiring specific API keys (OpenAI, Cohere) by focusing on mocked embeddings (lines 3-8, 35-60, 124-143)
tests/chunkers/test_semantic_chunker.py
- Added comprehensive tests for SemanticChunker, including parameter validation, mode configuration, threshold types, internal methods, threshold calculation, and edge cases (lines 399-850)
tests/chunkers/test_slumber_chunker.py
- Added a new test file with comprehensive tests for the SlumberChunker class, including initialization, internal methods, chunking, edge cases, prompt generation, representation, and integration (lines 1-627)
tests/cloud/test_cloud_code_chunker.py
- Added a new test file with comprehensive tests for the Cloud Code Chunker (lines 1-517)
tests/cloud/test_cloud_late_chunker.py
- Added a new test file with comprehensive tests for the Cloud Late Chunker (lines 1-225)
tests/cloud/test_cloud_neural_chunker.py
- Added a new test file with comprehensive tests for the Cloud Neural Chunker (lines 1-222)
tests/cloud/test_cloud_recursive_chunker.py
- Updated test file with tests for Cloud Recursive Chunker using mocking (lines 12-57, 93-160)
tests/cloud/test_cloud_sdpm_chunker.py
- Updated test file with tests for Cloud SDPM Chunker using mocking (lines 11-41, 129-210)
tests/cloud/test_cloud_slumber_chunker.py
- Added a new test file with comprehensive tests for the Cloud Slumber Chunker (lines 1-333)
tests/embeddings/test_auto_embeddings.py
- Added more comprehensive tests for AutoEmbeddings, including provider prefixes, different input types, and error handling (lines 18-166)
tests/embeddings/test_cohere_embeddings.py
- Added mocking for Cohere API calls and tokenizer download to enable tests without API key (lines 14-46)
- Added more comprehensive tests for CohereEmbeddings (lines 71-163)
tests/embeddings/test_embeddings_registry.py
- Added a new test file with comprehensive tests for the EmbeddingsRegistry class (lines 1-339)
tests/embeddings/test_gemini_embeddings.py
- Added a new test file with comprehensive tests for GeminiEmbeddings, including mocking and real API tests (if key available) (lines 1-344)
tests/embeddings/test_jina_embeddings.py
- Added comprehensive tests for JinaEmbeddings, including mocking and real API tests (if key available) (lines 15-606)
tests/embeddings/test_model2vec_embeddings.py
- Updated test to use _is_available method (line 84)
tests/embeddings/test_openai_embeddings.py
- Updated test to use _is_available method (line 129)
tests/embeddings/test_sentence_transformer_embeddings.py
- Updated test to use _is_available method (line 112)
tests/embeddings/test_voyageai_embeddings.py
- Added comprehensive tests for VoyageAIEmbeddings, including mocking and real API tests (if key available) (lines 14-594)
tests/genie/test_base_genie.py
- Added a new test file with comprehensive tests for the BaseGenie abstract class and its default batch implementations (lines 1-371)
tests/genie/test_gemini_genie.py
- Added a new test file with comprehensive tests for the GeminiGenie class (lines 1-246)
tests/genie/test_openai_genie.py
- Added a new test file with comprehensive tests for the OpenAIGenie class (lines 1-201)
tests/refinery/init.py
- Added an empty __init__.py file to the tests/refinery directory (lines 1)
tests/refinery/test_embedding_refinery.py
- Added a new test file with comprehensive tests for the EmbeddingsRefinery, including mocking (lines 1-350)
tests/refinery/test_overlap_refinery.py
- Added a new test file with comprehensive tests for the OverlapRefinery, including caching, recursive mode, and edge cases (lines 1-785)
tests/test_tokenizer.py
- Added comprehensive tests for CharacterTokenizer, WordTokenizer, and the unified Tokenizer class, including edge cases, error handling, and backend-specific paths (lines 512-1081)
tests/test_viz.py
- Added a new test file with comprehensive tests for the Visualizer class, including print, save, theme handling, and edge cases (lines 1-737)

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Code split and merged fast,
Cython speeds the task at last.
Gemini joins the crew,
Chunking, embedding anew,
A hippo's work, built to last.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant enhancements, primarily by adding Cython-based performance improvements for chunking operations (split and merge) and integrating support for Google's Gemini embedding models. The refactoring of AutoEmbeddings and EmbeddingsRegistry improves model loading logic and clarity. Documentation has been updatedcommendably to reflect these new features and API changes. The addition of new example files and comprehensive test suites is also a great step towards ensuring quality and usability.

Overall, the changes are well-implemented and address the goals outlined. There are a few areas where clarification or minor adjustments could be beneficial, detailed in the specific comments.

Summary of Findings

Cython Performance Enhancements: The introduction of Cython extensions for split and merge operations is a key highlight and should provide significant performance benefits for chunking.
Gemini Embeddings Support: Support for Google's Gemini embedding models has been added, expanding the range of available embedding options.
Embeddings Loading Refactor: The AutoEmbeddings and EmbeddingsRegistry classes have been substantially refactored, improving the clarity and flexibility of loading and managing embedding models, including new provider alias syntax.
API Changes: There are API changes, notably AutoEmbeddings.get_embedding() instead of direct instantiation, which should be clearly communicated to users.
OverlapRefinery Improvements: The OverlapRefinery has been improved with LRU caching for tokenization and more robust handling of float-based context_size.
Behavioral Change in CodeChunker: The default for _use_multiprocessing in CodeChunker was changed to False. The impact and reasoning for this should be clear.
Test Coverage: A significant number of new tests have been added, which is excellent for maintainability and reliability.
Documentation Updates: DOCS.md and README.md have been updated to reflect new features and API changes, which is crucial for users.
Minor: .gitignore entry: The .gitignore file includes a very specific compiled file path (src/chonkie/chunker/c_extensions/token_chunker.cpython-310-darwin.so). While *.so covers this, specific paths for local build artifacts are usually not needed if generic patterns suffice. This is a low severity finding and was not added as a comment due to review settings.
Minor: Cookbook Example Date: The new Gemini example file (cookbook/examples/gemini_embeddings_with_recursive_chunking.py) has a date of 2025. This is a low severity finding and was not added as a comment due to review settings.
Minor: pyproject.toml License Classifier: In pyproject.toml, the license field was changed to "MIT" (which is fine), but the corresponding classifier "License :: OSI Approved :: MIT License" was removed. Usually, these are kept together. This is a low severity finding and was not added as a comment due to review settings.

Merge Readiness

This pull request introduces substantial improvements and new features. The Cython extensions are a great step for performance, and Gemini support is a valuable addition. The refactoring in the embeddings system also enhances maintainability.

Before merging, I recommend addressing the medium severity comments, particularly clarifying the _use_multiprocessing change in CodeChunker, confirming the resolution of SlumberChunker issues, and considering the Gemini API batching question for potential future optimization. The removal of EmbeddingsRegistry.list_available might also be worth a quick discussion on whether an alternative is needed.

Once these points are clarified or addressed, the PR should be in good shape. As I am an AI, I am not authorized to approve pull requests; please ensure other reviewers approve this code before merging.

codecov · 2025-05-25T22:50:26Z

❌ 36 Tests Failed:

Tests completed	Failed	Passed	Skipped
2940	36	2904	208

View the top 3 failed test(s) by shortest run time

tests.handshakes.test_qdrant_handshake::test_qdrant_handshake_write_single_chunk

Stack Traces | 0s run time

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        <Tip warning={true}>
    
        Raises when the request has failed:
    
            - [`~utils.RepositoryNotFoundError`]
                If the repository to download from cannot be found. This may be because it
                doesn't exist, because `repo_type` is not set correctly, or because the repo
                is `private` and you do not have access.
            - [`~utils.GatedRepoError`]
                If the repository exists but is gated and the user is not on the authorized
                list.
            - [`~utils.RevisionNotFoundError`]
                If the repository exists but the revision couldn't be find.
            - [`~utils.EntryNotFoundError`]
                If the repository exists but the entry (e.g. the requested file) couldn't be
                find.
            - [`~utils.BadRequestError`]
                If request failed with a HTTP 400 BadRequest error.
            - [`~utils.HfHubHTTPError`]
                If request failed for a reason not listed above.
    
        </Tip>
        """
        try:
>           response.raise_for_status()

.venv/lib/python3.9.../huggingface_hub/utils/_http.py:409: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Response [429]>

    def raise_for_status(self):
        """Raises :class:`HTTPError`, if one occurred."""
    
        http_error_msg = ""
        if isinstance(self.reason, bytes):
            # We attempt to decode utf-8 first because some servers
            # choose to localize their reason strings. If the string
            # isn't utf-8, we fall back to iso-8859-1 for all other
            # encodings. (See PR #3538)
            try:
                reason = self.reason.decode("utf-8")
            except UnicodeDecodeError:
                reason = self.reason.decode("iso-8859-1")
        else:
            reason = self.reason
    
        if 400 <= self.status_code < 500:
            http_error_msg = (
                f"{self.status_code} Client Error: {reason} for url: {self.url}"
            )
    
        elif 500 <= self.status_code < 600:
            http_error_msg = (
                f"{self.status_code} Server Error: {reason} for url: {self.url}"
            )
    
        if http_error_msg:
>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.venv/lib/python3.9.../site-packages/requests/models.py:1024: HTTPError

The above exception was the direct cause of the following exception:

    def _get_metadata_or_catch_error(
        *,
        repo_id: str,
        filename: str,
        repo_type: str,
        revision: str,
        endpoint: Optional[str],
        proxies: Optional[Dict],
        etag_timeout: Optional[float],
        headers: Dict[str, str],  # mutated inplace!
        token: Union[bool, str, None],
        local_files_only: bool,
        relative_filename: Optional[str] = None,  # only used to store `.no_exists` in cache
        storage_folder: Optional[str] = None,  # only used to store `.no_exists` in cache
    ) -> Union[
        # Either an exception is caught and returned
        Tuple[None, None, None, None, None, Exception],
        # Or the metadata is returned as
        # `(url_to_download, etag, commit_hash, expected_size, xet_file_data, None)`
        Tuple[str, str, str, int, Optional[XetFileData], None],
    ]:
        """Get metadata for a file on the Hub, safely handling network issues.
    
        Returns either the etag, commit_hash and expected size of the file, or the error
        raised while fetching the metadata.
    
        NOTE: This function mutates `headers` inplace! It removes the `authorization` header
              if the file is a LFS blob and the domain of the url is different from the
              domain of the location (typically an S3 bucket).
        """
        if local_files_only:
            return (
                None,
                None,
                None,
                None,
                None,
                OfflineModeIsEnabled(
                    f"Cannot access file since 'local_files_only=True' as been set. (repo_id: {repo_id}, repo_type: {repo_type}, revision: {revision}, filename: {filename})"
                ),
            )
    
        url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
        url_to_download: str = url
        etag: Optional[str] = None
        commit_hash: Optional[str] = None
        expected_size: Optional[int] = None
        head_error_call: Optional[Exception] = None
        xet_file_data: Optional[XetFileData] = None
    
        # Try to get metadata from the server.
        # Do not raise yet if the file is not found or not accessible.
        if not local_files_only:
            try:
                try:
>                   metadata = get_hf_file_metadata(
                        url=url, proxies=proxies, timeout=etag_timeout, headers=headers, token=token
                    )

.venv/lib/python3.9...................../site-packages/huggingface_hub/file_download.py:1484: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.9.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.venv/lib/python3.9...................../site-packages/huggingface_hub/file_download.py:1401: in get_hf_file_metadata
    r = _request_wrapper(
.venv/lib/python3.9...................../site-packages/huggingface_hub/file_download.py:285: in _request_wrapper
    response = _request_wrapper(
.venv/lib/python3.9...................../site-packages/huggingface_hub/file_download.py:309: in _request_wrapper
    hf_raise_for_status(response)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        <Tip warning={true}>
    
        Raises when the request has failed:
    
            - [`~utils.RepositoryNotFoundError`]
                If the repository to download from cannot be found. This may be because it
                doesn't exist, because `repo_type` is not set correctly, or because the repo
                is `private` and you do not have access.
            - [`~utils.GatedRepoError`]
                If the repository exists but is gated and the user is not on the authorized
                list.
            - [`~utils.RevisionNotFoundError`]
                If the repository exists but the revision couldn't be find.
            - [`~utils.EntryNotFoundError`]
                If the repository exists but the entry (e.g. the requested file) couldn't be
                find.
            - [`~utils.BadRequestError`]
                If request failed with a HTTP 400 BadRequest error.
            - [`~utils.HfHubHTTPError`]
                If request failed for a reason not listed above.
    
        </Tip>
        """
        try:
            response.raise_for_status()
        except HTTPError as e:
            error_code = response.headers.get("X-Error-Code")
            error_message = response.headers.get("X-Error-Message")
    
            if error_code == "RevisionNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Revision Not Found for url: {response.url}."
                raise _format(RevisionNotFoundError, message, response) from e
    
            elif error_code == "EntryNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Entry Not Found for url: {response.url}."
                raise _format(EntryNotFoundError, message, response) from e
    
            elif error_code == "GatedRepo":
                message = (
                    f"{response.status_code} Client Error." + "\n\n" + f"Cannot access gated repo for url {response.url}."
                )
                raise _format(GatedRepoError, message, response) from e
    
            elif error_message == "Access to this resource is disabled.":
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Cannot access repository for url {response.url}."
                    + "\n"
                    + "Access to this resource is disabled."
                )
                raise _format(DisabledRepoError, message, response) from e
    
            elif error_code == "RepoNotFound" or (
                response.status_code == 401
                and error_message != "Invalid credentials in Authorization header"
                and response.request is not None
                and response.request.url is not None
                and REPO_API_REGEX.search(response.request.url) is not None
            ):
                # 401 is misleading as it is returned for:
                #    - private and gated repos if user is not authenticated
                #    - missing repos
                # => for now, we process them as `RepoNotFound` anyway.
                # See https://gist.github.com/Wauplin/46c27ad266b15998ce56a6603796f0b9
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Repository Not Found for url: {response.url}."
                    + "\nPlease make sure you specified the correct `repo_id` and"
                    " `repo_type`.\nIf you are trying to access a private or gated repo,"
                    " make sure you are authenticated. For more details, see"
                    " https://huggingface..../docs/huggingface_hub/authentication"
                )
                raise _format(RepositoryNotFoundError, message, response) from e
    
            elif response.status_code == 400:
                message = (
                    f"\n\nBad request for {endpoint_name} endpoint:" if endpoint_name is not None else "\n\nBad request:"
                )
                raise _format(BadRequestError, message, response) from e
    
            elif response.status_code == 403:
                message = (
                    f"\n\n{response.status_code} Forbidden: {error_message}."
                    + f"\nCannot access content at: {response.url}."
                    + "\nMake sure your token has the correct permissions."
                )
                raise _format(HfHubHTTPError, message, response) from e
    
            elif response.status_code == 416:
                range_header = response.request.headers.get("Range")
                message = f"{e}. Requested range: {range_header}. Content-Range: {response.headers.get('Content-Range')}."
                raise _format(HfHubHTTPError, message, response) from e
    
            # Convert `HTTPError` into a `HfHubHTTPError` to display request information
            # as well (request id and/or server error message)
>           raise _format(HfHubHTTPError, str(e), response) from e
E           huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.venv/lib/python3.9.../huggingface_hub/utils/_http.py:482: HfHubHTTPError

The above exception was the direct cause of the following exception:

path_or_repo_id = 'minishlab/potion-retrieval-32M', filenames = ['config.json']
cache_dir = '....../home/runner/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.51.0; python/3.9.22; session_id/6622f2e5bfc24f4a9690223555789824; torch/2.6.0; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json', file_counter = 0

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`List[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
>               hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )

.venv/lib/python3.9.../transformers/utils/hub.py:424: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.9.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.venv/lib/python3.9...................../site-packages/huggingface_hub/file_download.py:961: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.venv/lib/python3.9...................../site-packages/huggingface_hub/file_download.py:1068: in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

head_call_error = HfHubHTTPError('429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json')
force_download = False, local_files_only = False

    def _raise_on_head_call_error(head_call_error: Exception, force_download: bool, local_files_only: bool) -> NoReturn:
        """Raise an appropriate error when the HEAD call failed and we cannot locate a local file."""
        # No head call => we cannot force download.
        if force_download:
            if local_files_only:
                raise ValueError("Cannot pass 'force_download=True' and 'local_files_only=True' at the same time.")
            elif isinstance(head_call_error, OfflineModeIsEnabled):
                raise ValueError("Cannot pass 'force_download=True' when offline mode is enabled.") from head_call_error
            else:
                raise ValueError("Force download failed due to the above error.") from head_call_error
    
        # No head call + couldn't find an appropriate file on disk => raise an error.
        if local_files_only:
            raise LocalEntryNotFoundError(
                "Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable"
                " hf.co look-ups and downloads online, set 'local_files_only' to False."
            )
        elif isinstance(head_call_error, (RepositoryNotFoundError, GatedRepoError)) or (
            isinstance(head_call_error, HfHubHTTPError) and head_call_error.response.status_code == 401
        ):
            # Repo not found or gated => let's raise the actual error
            # Unauthorized => likely a token issue => let's raise the actual error
            raise head_call_error
        else:
            # Otherwise: most likely a connection issue or Hub downtime => let's warn the user
>           raise LocalEntryNotFoundError(
                "An error happened while trying to locate the file on the Hub and we cannot find the requested files"
                " in the local cache. Please check your connection and try again or make sure your Internet connection"
                " is on."
            ) from head_call_error
E           huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

.venv/lib/python3.9...................../site-packages/huggingface_hub/file_download.py:1599: LocalEntryNotFoundError

The above exception was the direct cause of the following exception:

cls = <class 'chonkie.embeddings.auto.AutoEmbeddings'>
model = 'minishlab/potion-retrieval-32M', kwargs = {}
embeddings_instance = None, embeddings_cls = None
SentenceTransformerEmbeddings = <class 'chonkie.embeddings.sentence_transformer.SentenceTransformerEmbeddings'>

    @classmethod
    def get_embeddings(cls, model: Union[str, BaseEmbeddings, Any], **kwargs: Any) -> BaseEmbeddings:
        """Get embeddings instance based on identifier.
    
        Args:
            model: Identifier for the embeddings (name, path, URL, etc.)
            **kwargs: Additional arguments passed to the embeddings constructor
    
        Returns:
            Initialized embeddings instance
    
        Raises:
            ValueError: If no suitable embeddings implementation is found
    
        Examples:
            # Get sentence transformers embeddings
            embeddings = AutoEmbeddings.get_embeddings("sentence-transformers/all-MiniLM-L6-v2")
    
            # Get OpenAI embeddings
            embeddings = AutoEmbeddings.get_embeddings("openai://text-embedding-ada-002", api_key="...")
    
            # Get Anthropic embeddings
            embeddings = AutoEmbeddings.get_embeddings("anthropic://claude-v1", api_key="...")
    
            # Get Cohere embeddings
            embeddings = AutoEmbeddings.get_embeddings("cohere://embed-english-light-v3.0", api_key="...")
    
        """
        # Load embeddings instance if already provided
        if isinstance(model, BaseEmbeddings):
            return model
        elif isinstance(model, str):
            # Initializing the embedding instance
            embeddings_instance = None
    
            # Check if the user passed in a provider alias
            if "://" in model:
                provider, model_name = model.split("://")
                embeddings_cls = EmbeddingsRegistry.get_provider(provider)
                if embeddings_cls:
                    try:
                        return embeddings_cls(model_name, **kwargs)  # type: ignore
                    except Exception as error:
                        raise ValueError(f"Failed to load {model} with {embeddings_cls.__name__}, with error: {error}")
                else:
                    raise ValueError(f"No provider found for {provider}. Please check the provider name and try again.")
            else:
                # Try to find matching implementation via registry
                embeddings_cls = EmbeddingsRegistry.match(model)
                if embeddings_cls:
                        try:
                            # Try instantiating with the model identifier
                            embeddings_instance = embeddings_cls(model, **kwargs)  # type: ignore
                        except Exception as error:
                            warnings.warn(
                                f"Failed to load {model} with {embeddings_cls.__name__}: {error}\n"
                                f"Falling back to loading default provider model."
                            )
                            try:
                                # Try instantiating with the default provider model without the model identifier
                                embeddings_instance = embeddings_cls(**kwargs)
                            except Exception as error:
                                warnings.warn(
                                    f"Failed to load the default model for {embeddings_cls.__name__}: {error}\n"
                                    f"Falling back to SentenceTransformerEmbeddings."
                                )
    
            # If registry lookup and instantiation succeeded, return the instance
            if embeddings_instance:
                return embeddings_instance
    
            # If registry lookup and instantiation failed, return the default SentenceTransformerEmbeddings
            from .sentence_transformer import SentenceTransformerEmbeddings
            try:
>               return SentenceTransformerEmbeddings(model, **kwargs)

.../chonkie/embeddings/auto.py:107: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../chonkie/embeddings/sentence_transformer.py:49: in __init__
    self.model = SentenceTransformer(self.model_name_or_path, **kwargs)
.venv/lib/python3.9....../site-packages/sentence_transformers/SentenceTransformer.py:321: in __init__
    modules = self._load_auto_model(
.venv/lib/python3.9....../site-packages/sentence_transformers/SentenceTransformer.py:1600: in _load_auto_model
    transformer_model = Transformer(
.venv/lib/python3.9.../sentence_transformers/models/Transformer.py:80: in __init__
    config, is_peft_model = self._load_config(model_name_or_path, cache_dir, backend, config_args)
.venv/lib/python3.9.../sentence_transformers/models/Transformer.py:145: in _load_config
    return AutoConfig.from_pretrained(model_name_or_path, **config_args, cache_dir=cache_dir), False
.venv/lib/python3.9.../models/auto/configuration_auto.py:1112: in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
.venv/lib/python3.9....../site-packages/transformers/configuration_utils.py:590: in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
.venv/lib/python3.9....../site-packages/transformers/configuration_utils.py:649: in _get_config_dict
    resolved_config_file = cached_file(
.venv/lib/python3.9.../transformers/utils/hub.py:266: in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path_or_repo_id = 'minishlab/potion-retrieval-32M', filenames = ['config.json']
cache_dir = '....../home/runner/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.51.0; python/3.9.22; session_id/6622f2e5bfc24f4a9690223555789824; torch/2.6.0; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json', file_counter = 0

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`List[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
                hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
            else:
                snapshot_download(
                    path_or_repo_id,
                    allow_patterns=full_filenames,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
    
        except Exception as e:
            # We cannot recover from them
            if isinstance(e, RepositoryNotFoundError) and not isinstance(e, GatedRepoError):
                raise OSError(
                    f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
                    "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to pass a token "
                    "having permission to this repo either by logging in with `huggingface-cli login` or by passing "
                    "`token=<your_token>`"
                ) from e
            elif isinstance(e, RevisionNotFoundError):
                raise OSError(
                    f"{revision} is not a valid git identifier (branch name, tag name or commit id) that exists "
                    "for this model name. Check the model page at "
                    f"'https://huggingface.co/{path_or_repo_id}' for available revisions."
                ) from e
    
            # Now we try to recover if we can find all files correctly in the cache
            resolved_files = [
                _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames
            ]
            if all(file is not None for file in resolved_files):
                return resolved_files
    
            # Raise based on the flags. Note that we will raise for missing entries at the very end, even when
            # not entering this Except block, as it may also happen when `snapshot_download` does not raise
            if isinstance(e, GatedRepoError):
                if not _raise_exceptions_for_gated_repo:
                    return None
                raise OSError(
                    "You are trying to access a gated repo.\nMake sure to have access to it at "
                    f"https://huggingface.co/{path_or_repo_id}.\n{str(e)}"
                ) from e
            elif isinstance(e, LocalEntryNotFoundError):
                if not _raise_exceptions_for_connection_errors:
                    return None
                # Here we only raise if both flags for missing entry and connection errors are True (because it can be raised
                # even when `local_files_only` is True, in which case raising for connections errors only would not make sense)
                elif _raise_exceptions_for_missing_entries:
>                   raise OSError(
                        f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load the files, and couldn't find them in the"
                        f" cached files.\nCheckout your internet connection or see how to run the library in offline mode at"
                        " 'https://huggingface..../docs/transformers/installation#offline-mode'."
                    ) from e
E                   OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E                   Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.venv/lib/python3.9.../transformers/utils/hub.py:491: OSError

During handling of the above exception, another exception occurred:

    @pytest.fixture(scope="module")
    def real_embeddings() -> BaseEmbeddings:
        """Provide
260E
 an instance of the actual default embedding model."""
        # Use scope="module" to load the model only once per test module run
        # Set environment variable to potentially avoid Hugging Face Hub login prompts in some CI environments
        os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
>       return AutoEmbeddings.get_embeddings(DEFAULT_EMBEDDING_MODEL)

tests/handshakes/test_qdrant_handshake.py:61: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'chonkie.embeddings.auto.AutoEmbeddings'>
model = 'minishlab/potion-retrieval-32M', kwargs = {}
embeddings_instance = None, embeddings_cls = None
SentenceTransformerEmbeddings = <class 'chonkie.embeddings.sentence_transformer.SentenceTransformerEmbeddings'>

    @classmethod
    def get_embeddings(cls, model: Union[str, BaseEmbeddings, Any], **kwargs: Any) -> BaseEmbeddings:
        """Get embeddings instance based on identifier.
    
        Args:
            model: Identifier for the embeddings (name, path, URL, etc.)
            **kwargs: Additional arguments passed to the embeddings constructor
    
        Returns:
            Initialized embeddings instance
    
        Raises:
            ValueError: If no suitable embeddings implementation is found
    
        Examples:
            # Get sentence transformers embeddings
            embeddings = AutoEmbeddings.get_embeddings("sentence-transformers/all-MiniLM-L6-v2")
    
            # Get OpenAI embeddings
            embeddings = AutoEmbeddings.get_embeddings("openai://text-embedding-ada-002", api_key="...")
    
            # Get Anthropic embeddings
            embeddings = AutoEmbeddings.get_embeddings("anthropic://claude-v1", api_key="...")
    
            # Get Cohere embeddings
            embeddings = AutoEmbeddings.get_embeddings("cohere://embed-english-light-v3.0", api_key="...")
    
        """
        # Load embeddings instance if already provided
        if isinstance(model, BaseEmbeddings):
            return model
        elif isinstance(model, str):
            # Initializing the embedding instance
            embeddings_instance = None
    
            # Check if the user passed in a provider alias
            if "://" in model:
                provider, model_name = model.split("://")
                embeddings_cls = EmbeddingsRegistry.get_provider(provider)
                if embeddings_cls:
                    try:
                        return embeddings_cls(model_name, **kwargs)  # type: ignore
                    except Exception as error:
                        raise ValueError(f"Failed to load {model} with {embeddings_cls.__name__}, with error: {error}")
                else:
                    raise ValueError(f"No provider found for {provider}. Please check the provider name and try again.")
            else:
                # Try to find matching implementation via registry
                embeddings_cls = EmbeddingsRegistry.match(model)
                if embeddings_cls:
                        try:
                            # Try instantiating with the model identifier
                            embeddings_instance = embeddings_cls(model, **kwargs)  # type: ignore
                        except Exception as error:
                            warnings.warn(
                                f"Failed to load {model} with {embeddings_cls.__name__}: {error}\n"
                                f"Falling back to loading default provider model."
                            )
                            try:
                                # Try instantiating with the default provider model without the model identifier
                                embeddings_instance = embeddings_cls(**kwargs)
                            except Exception as error:
                                warnings.warn(
                                    f"Failed to load the default model for {embeddings_cls.__name__}: {error}\n"
                                    f"Falling back to SentenceTransformerEmbeddings."
                                )
    
            # If registry lookup and instantiation succeeded, return the instance
            if embeddings_instance:
                return embeddings_instance
    
            # If registry lookup and instantiation failed, return the default SentenceTransformerEmbeddings
            from .sentence_transformer import SentenceTransformerEmbeddings
            try:
                return SentenceTransformerEmbeddings(model, **kwargs)
            except Exception as e:
>               raise ValueError(f"Failed to load embeddings via SentenceTransformerEmbeddings after registry/fallback failure: {e}")
E               ValueError: Failed to load embeddings via SentenceTransformerEmbeddings after registry/fallback failure: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E               Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.../chonkie/embeddings/auto.py:109: ValueError

tests.handshakes.test_qdrant_handshake::test_generate_payload

Stack Traces | 0.001s run time

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        <Tip warning={true}>
    
        Raises when the request has failed:
    
            - [`~utils.RepositoryNotFoundError`]
                If the repository to download from cannot be found. This may be because it
                doesn't exist, because `repo_type` is not set correctly, or because the repo
                is `private` and you do not have access.
            - [`~utils.GatedRepoError`]
                If the repository exists but is gated and the user is not on the authorized
                list.
            - [`~utils.RevisionNotFoundError`]
                If the repository exists but the revision couldn't be find.
            - [`~utils.EntryNotFoundError`]
                If the repository exists but the entry (e.g. the requested file) couldn't be
                find.
            - [`~utils.BadRequestError`]
                If request failed with a HTTP 400 BadRequest error.
            - [`~utils.HfHubHTTPError`]
                If request failed for a reason not listed above.
    
        </Tip>
        """
        try:
>           response.raise_for_status()

.venv/lib/python3.12.../huggingface_hub/utils/_http.py:409: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Response [429]>

    def raise_for_status(self):
        """Raises :class:`HTTPError`, if one occurred."""
    
        http_error_msg = ""
        if isinstance(self.reason, bytes):
            # We attempt to decode utf-8 first because some servers
            # choose to localize their reason strings. If the string
            # isn't utf-8, we fall back to iso-8859-1 for all other
            # encodings. (See PR #3538)
            try:
                reason = self.reason.decode("utf-8")
            except UnicodeDecodeError:
                reason = self.reason.decode("iso-8859-1")
        else:
            reason = self.reason
    
        if 400 <= self.status_code < 500:
            http_error_msg = (
                f"{self.status_code} Client Error: {reason} for url: {self.url}"
            )
    
        elif 500 <= self.status_code < 600:
            http_error_msg = (
                f"{self.status_code} Server Error: {reason} for url: {self.url}"
            )
    
        if http_error_msg:
>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.venv/lib/python3.12.../site-packages/requests/models.py:1024: HTTPError

The above exception was the direct cause of the following exception:

    def _get_metadata_or_catch_error(
        *,
        repo_id: str,
        filename: str,
        repo_type: str,
        revision: str,
        endpoint: Optional[str],
        proxies: Optional[Dict],
        etag_timeout: Optional[float],
        headers: Dict[str, str],  # mutated inplace!
        token: Union[bool, str, None],
        local_files_only: bool,
        relative_filename: Optional[str] = None,  # only used to store `.no_exists` in cache
        storage_folder: Optional[str] = None,  # only used to store `.no_exists` in cache
    ) -> Union[
        # Either an exception is caught and returned
        Tuple[None, None, None, None, None, Exception],
        # Or the metadata is returned as
        # `(url_to_download, etag, commit_hash, expected_size, xet_file_data, None)`
        Tuple[str, str, str, int, Optional[XetFileData], None],
    ]:
        """Get metadata for a file on the Hub, safely handling network issues.
    
        Returns either the etag, commit_hash and expected size of the file, or the error
        raised while fetching the metadata.
    
        NOTE: This function mutates `headers` inplace! It removes the `authorization` header
              if the file is a LFS blob and the domain of the url is different from the
              domain of the location (typically an S3 bucket).
        """
        if local_files_only:
            return (
                None,
                None,
                None,
                None,
                None,
                OfflineModeIsEnabled(
                    f"Cannot access file since 'local_files_only=True' as been set. (repo_id: {repo_id}, repo_type: {repo_type}, revision: {revision}, filename: {filename})"
                ),
            )
    
        url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
        url_to_download: str = url
        etag: Optional[str] = None
        commit_hash: Optional[str] = None
        expected_size: Optional[int] = None
        head_error_call: Optional[Exception] = None
        xet_file_data: Optional[XetFileData] = None
    
        # Try to get metadata from the server.
        # Do not raise yet if the file is not found or not accessible.
        if not local_files_only:
            try:
                try:
>                   metadata = get_hf_file_metadata(
                        url=url, proxies=proxies, timeout=etag_timeout, headers=headers, token=token
                    )

.venv/lib/python3.12...................../site-packages/huggingface_hub/file_download.py:1484: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.venv/lib/python3.12...................../site-packages/huggingface_hub/file_download.py:1401: in get_hf_file_metadata
    r = _request_wrapper(
.venv/lib/python3.12...................../site-packages/huggingface_hub/file_download.py:285: in _request_wrapper
    response = _request_wrapper(
.venv/lib/python3.12...................../site-packages/huggingface_hub/file_download.py:309: in _request_wrapper
    hf_raise_for_status(response)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        <Tip warning={true}>
    
        Raises when the request has failed:
    
            - [`~utils.RepositoryNotFoundError`]
                If the repository to download from cannot be found. This may be because it
                doesn't exist, because `repo_type` is not set correctly, or because the repo
                is `private` and you do not have access.
            - [`~utils.GatedRepoError`]
                If the repository exists but is gated and the user is not on the authorized
                list.
            - [`~utils.RevisionNotFoundError`]
                If the repository exists but the revision couldn't be find.
            - [`~utils.EntryNotFoundError`]
                If the repository exists but the entry (e.g. the requested file) couldn't be
                find.
            - [`~utils.BadRequestError`]
                If request failed with a HTTP 400 BadRequest error.
            - [`~utils.HfHubHTTPError`]
                If request failed for a reason not listed above.
    
        </Tip>
        """
        try:
            response.raise_for_status()
        except HTTPError as e:
            error_code = response.headers.get("X-Error-Code")
            error_message = response.headers.get("X-Error-Message")
    
            if error_code == "RevisionNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Revision Not Found for url: {response.url}."
                raise _format(RevisionNotFoundError, message, response) from e
    
            elif error_code == "EntryNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Entry Not Found for url: {response.url}."
                raise _format(EntryNotFoundError, message, response) from e
    
            elif error_code == "GatedRepo":
                message = (
                    f"{response.status_code} Client Error." + "\n\n" + f"Cannot access gated repo for url {response.url}."
                )
                raise _format(GatedRepoError, message, response) from e
    
            elif error_message == "Access to this resource is disabled.":
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Cannot access repository for url {response.url}."
                    + "\n"
                    + "Access to this resource is disabled."
                )
                raise _format(DisabledRepoError, message, response) from e
    
            elif error_code == "RepoNotFound" or (
                response.status_code == 401
                and error_message != "Invalid credentials in Authorization header"
                and response.request is not None
                and response.request.url is not None
                and REPO_API_REGEX.search(response.request.url) is not None
            ):
                # 401 is misleading as it is returned for:
                #    - private and gated repos if user is not authenticated
                #    - missing repos
                # => for now, we process them as `RepoNotFound` anyway.
                # See https://gist.github.com/Wauplin/46c27ad266b15998ce56a6603796f0b9
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Repository Not Found for url: {response.url}."
                    + "\nPlease make sure you specified the correct `repo_id` and"
                    " `repo_type`.\nIf you are trying to access a private or gated repo,"
                    " make sure you are authenticated. For more details, see"
                    " https://huggingface..../docs/huggingface_hub/authentication"
                )
                raise _format(RepositoryNotFoundError, message, response) from e
    
            elif response.status_code == 400:
                message = (
                    f"\n\nBad request for {endpoint_name} endpoint:" if endpoint_name is not None else "\n\nBad request:"
                )
                raise _format(BadRequestError, message, response) from e
    
            elif response.status_code == 403:
                message = (
                    f"\n\n{response.status_code} Forbidden: {error_message}."
                    + f"\nCannot access content at: {response.url}."
                    + "\nMake sure your token has the correct permissions."
                )
                raise _format(HfHubHTTPError, message, response) from e
    
            elif response.status_code == 416:
                range_header = response.request.headers.get("Range")
                message = f"{e}. Requested range: {range_header}. Content-Range: {response.headers.get('Content-Range')}."
                raise _format(HfHubHTTPError, message, response) from e
    
            # Convert `HTTPError` into a `HfHubHTTPError` to display request information
            # as well (request id and/or server error message)
>           raise _format(HfHubHTTPError, str(e), response) from e
E           huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.venv/lib/python3.12.../huggingface_hub/utils/_http.py:482: HfHubHTTPError

The above exception was the direct cause of the following exceptio
10000
n:

path_or_repo_id = 'minishlab/potion-retrieval-32M', filenames = ['config.json']
cache_dir = '....../home/runner/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.51.0; python/3.12.3; session_id/5c47f513e1bb46c6881252929c77821f; torch/2.6.0; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json'

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`List[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
>               hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )

.venv/lib/python3.12.../transformers/utils/hub.py:424: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.venv/lib/python3.12...................../site-packages/huggingface_hub/file_download.py:961: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12...................../site-packages/huggingface_hub/file_download.py:1068: in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

head_call_error = HfHubHTTPError('429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json')
force_download = False, local_files_only = False

    def _raise_on_head_call_error(head_call_error: Exception, force_download: bool, local_files_only: bool) -> NoReturn:
        """Raise an appropriate error when the HEAD call failed and we cannot locate a local file."""
        # No head call => we cannot force download.
        if force_download:
            if local_files_only:
                raise ValueError("Cannot pass 'force_download=True' and 'local_files_only=True' at the same time.")
            elif isinstance(head_call_error, OfflineModeIsEnabled):
                raise ValueError("Cannot pass 'force_download=True' when offline mode is enabled.") from head_call_error
            else:
                raise ValueError("Force download failed due to the above error.") from head_call_error
    
        # No head call + couldn't find an appropriate file on disk => raise an error.
        if local_files_only:
            raise LocalEntryNotFoundError(
                "Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable"
                " hf.co look-ups and downloads online, set 'local_files_only' to False."
            )
        elif isinstance(head_call_error, (RepositoryNotFoundError, GatedRepoError)) or (
            isinstance(head_call_error, HfHubHTTPError) and head_call_error.response.status_code == 401
        ):
            # Repo not found or gated => let's raise the actual error
            # Unauthorized => likely a token issue => let's raise the actual error
            raise head_call_error
        else:
            # Otherwise: most likely a connection issue or Hub downtime => let's warn the user
>           raise LocalEntryNotFoundError(
                "An error happened while trying to locate the file on the Hub and we cannot find the requested files"
                " in the local cache. Please check your connection and try again or make sure your Internet connection"
                " is on."
            ) from head_call_error
E           huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

.venv/lib/python3.12...................../site-packages/huggingface_hub/file_download.py:1599: LocalEntryNotFoundError

The above exception was the direct cause of the following exception:

cls = <class 'chonkie.embeddings.auto.AutoEmbeddings'>
model = 'minishlab/potion-retrieval-32M', kwargs = {}
embeddings_instance = None, embeddings_cls = None
SentenceTransformerEmbeddings = <class 'chonkie.embeddings.sentence_transformer.SentenceTransformerEmbeddings'>

    @classmethod
    def get_embeddings(cls, model: Union[str, BaseEmbeddings, Any], **kwargs: Any) -> BaseEmbeddings:
        """Get embeddings instance based on identifier.
    
        Args:
            model: Identifier for the embeddings (name, path, URL, etc.)
            **kwargs: Additional arguments passed to the embeddings constructor
    
        Returns:
            Initialized embeddings instance
    
        Raises:
            ValueError: If no suitable embeddings implementation is found
    
        Examples:
            # Get sentence transformers embeddings
            embeddings = AutoEmbeddings.get_embeddings("sentence-transformers/all-MiniLM-L6-v2")
    
            # Get OpenAI embeddings
            embeddings = AutoEmbeddings.get_embeddings("openai://text-embedding-ada-002", api_key="...")
    
            # Get Anthropic embeddings
            embeddings = AutoEmbeddings.get_embeddings("anthropic://claude-v1", api_key="...")
    
            # Get Cohere embeddings
            embeddings = AutoEmbeddings.get_embeddings("cohere://embed-english-light-v3.0", api_key="...")
    
        """
        # Load embeddings instance if already provided
        if isinstance(model, BaseEmbeddings):
            return model
        elif isinstance(model, str):
            # Initializing the embedding instance
            embeddings_instance = None
    
            # Check if the user passed in a provider alias
            if "://" in model:
                provider, model_name = model.split("://")
                embeddings_cls = EmbeddingsRegistry.get_provider(provider)
                if embeddings_cls:
                    try:
                        return embeddings_cls(model_name, **kwargs)  # type: ignore
                    except Exception as error:
                        raise ValueError(f"Failed to load {model} with {embeddings_cls.__name__}, with error: {error}")
                else:
                    raise ValueError(f"No provider found for {provider}. Please check the provider name and try again.")
            else:
                # Try to find matching implementation via registry
                embeddings_cls = EmbeddingsRegistry.match(model)
                if embeddings_cls:
                        try:
                            # Try instantiating with the model identifier
                            embeddings_instance = embeddings_cls(model, **kwargs)  # type: ignore
                        except Exception as error:
                            warnings.warn(
                                f"Failed to load {model} with {embeddings_cls.__name__}: {error}\n"
                                f"Falling back to loading default provider model."
                            )
                            try:
                                # Try instantiating with the default provider model without the model identifier
                                embeddings_instance = embeddings_cls(**kwargs)
                            except Exception as error:
                                warnings.warn(
                                    f"Failed to load the default model for {embeddings_cls.__name__}: {error}\n"
                                    f"Falling back to SentenceTransformerEmbeddings."
                                )
    
            # If registry lookup and instantiation succeeded, return the instance
            if embeddings_instance:
                return embeddings_instance
    
            # If registry lookup and instantiation failed, return the default SentenceTransformerEmbeddings
            from .sentence_transformer import SentenceTransformerEmbeddings
            try:
>               return SentenceTransformerEmbeddings(model, **kwargs)

.../chonkie/embeddings/auto.py:107: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../chonkie/embeddings/sentence_transformer.py:49: in __init__
    self.model = SentenceTransformer(self.model_name_or_path, **kwargs)
.venv/lib/python3.12....../site-packages/sentence_transformers/SentenceTransformer.py:321: in __init__
    modules = self._load_auto_model(
.venv/lib/python3.12....../site-packages/sentence_transformers/SentenceTransformer.py:1600: in _load_auto_model
    transformer_model = Transformer(
.venv/lib/python3.12.../sentence_transformers/models/Transformer.py:80: in __init__
    config, is_peft_model = self._load_config(model_name_or_path, cache_dir, backend, config_args)
.venv/lib/python3.12.../sentence_transformers/models/Transformer.py:145: in _load_config
    return AutoConfig.from_pretrained(model_name_or_path, **config_args, cache_dir=cache_dir), False
.venv/lib/python3.12.../models/auto/configuration_auto.py:1112: in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
.venv/lib/python3.12....../site-packages/transformers/configuration_utils.py:590: in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
.venv/lib/python3.12....../site-packages/transformers/configuration_utils.py:649: in _get_config_dict
    resolved_config_file = cached_file(
.venv/lib/python3.12.../transformers/utils/hub.py:266: in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path_or_repo_id = 'minishlab/potion-retrieval-32M', filenames = ['config.json']
cache_dir = '....../home/runner/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.51.0; python/3.12.3; session_id/5c47f513e1bb46c6881252929c77821f; torch/2.6.0; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json'

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`List[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
                hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
            else:
                snapshot_download(
                    path_or_repo_id,
                    allow_patterns=full_filenames,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
    
        except Exception as e:
            # We cannot recover from them
            if isinstance(e, RepositoryNotFoundError) and not isinstance(e, GatedRepoError):
                raise OSError(
                    f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
                    "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to pass a token "
                    "having permission to this repo either by logging in with `huggingface-cli login` or by passing "
                    "`token=<your_token>`"
                ) from e
            elif isinstance(e, RevisionNotFoundError):
                raise OSError(
                    f"{revision} is not a valid git identifier (branch name, tag name or commit id) that exists "
                    "for this model name. Check the model page at "
                    f"'https://huggingface.co/{path_or_repo_id}' for available revisions."
                ) from e
    
            # Now we try to recover if we can find all files correctly in the cache
            resolved_files = [
                _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames
            ]
            if all(file is not None for file in resolved_files):
                return resolved_files
    
            # Raise based on the flags. Note that we will raise for missing entries at the very end, even when
            # not entering this Except block, as it may also happen when `snapshot_download` does not raise
            if isinstance(e, GatedRepoError):
                if not _raise_exceptions_for_gated_repo:
                    return None
                raise OSError(
                    "You are trying to access a gated repo.\nMake sure to have access to it at "
                    f"https://huggingface.co/{path_or_repo_id}.\n{str(e)}"
                ) from e
            elif isinstance(e, LocalEntryNotFoundError):
                if not _raise_exceptions_for_connection_errors:
                    return None
                # Here we only raise if both flags for missing entry and connection errors are True (because it can be raised
                # even when `local_files_only` is True, in which case raising for connections errors only would not make sense)
                elif _raise_exceptions_for_missing_entries:
>                   raise OSError(
                        f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load the files, and couldn't find them in the"
                        f" cached files.\nCheckout your internet connection or see how to run the library in offline mode at"
                        " 'https://huggingface..../docs/transformers/installation#offline-mode'."
                    ) from e
E                   OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E                   Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.venv/lib/python3.12.../transformers/utils/hub.py:491: OSError

During handling of the above exception, another exception occurred:

    @pytest.fixture(scope="module")
    def real_embeddings() -> BaseEmbeddings:
        """Provide an instance of the actual default embedding model."""
        # Use scope="module" to load the model only once per test module run
        # Set environment variable to potentially avoid Hugging Face Hub login prompts in some CI environments
        os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
>       return AutoEmbeddings.get_embeddings(DEFAULT_EMBEDDING_MODEL)

tests/handshakes/test_qdrant_handshake.py:61: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'chonkie.embeddings.auto.AutoEmbeddings'>
model = 'minishlab/potion-retrieval-32M', kwargs = {}
embeddings_instance = None, embeddings_cls = None
SentenceTransformerEmbeddings = <class 'chonkie.embeddings.sentence_transformer.SentenceTransformerEmbeddings'>

    @classmethod
    def get_embeddings(cls, model: Union[str, BaseEmbeddings, Any], **kwargs: Any) -> BaseEmbeddings:
        """Get embeddings instance based on identifier.
    
        Args:
            model: Identifier for the embeddings (name, path, URL, etc.)
            **kwargs: Additional arguments passed to the embeddings constructor
    
        Returns:
            Initialized embeddings instance
    
        Raises:
            ValueError: If no suitable embeddings implementation is found
    
        Examples:
            # Get sentence transformers embeddings
            embeddings = AutoEmbeddings.get_embeddings("sentence-transformers/all-MiniLM-L6-v2")
    
            # Get OpenAI embeddings
            embeddings = AutoEmbeddings.get_embeddings("openai://text-embedding-ada-002", api_key="...")
    
            # Get Anthropic embeddings
            embeddings = AutoEmbeddings.get_embeddings("anthropic://claude-v1", api_key="...")
    
            # Get Cohere embeddings
            embeddings = AutoEmbeddings.get_embeddings("cohere://embed-english-light-v3.0", api_key="...")
    
        """
        # Load embeddings instance if already provided
        if isinstance(model, BaseEmbeddings):
            return model
        elif isinstance(model, str):
            # Initializing the embedding instance
            embeddings_instance = None
    
            # Check if the user passed in a provider alias
            if "://" in model:
                provider, model_name = model.split("://")
                embeddings_cls = EmbeddingsRegistry.get_provider(provider)
                if embeddings_cls:
                    try:
                        return embeddings_cls(model_name, **kwargs)  # type: ignore
                    except Exception as error:
                        raise ValueError(f"Failed to load {model} with {embeddings_cls.__name__}, with error: {error}")
                else:
                    raise ValueError(f"No provider found for {provider}. Please check the provider name and try again.")
            else:
                # Try to find matching implementation via registry
                embeddings_cls = EmbeddingsRegistry.match(model)
                if embeddings_cls:
                        try:
                            # Try instantiating with the model identifier
                            embeddings_instance = embeddings_cls(model, **kwargs)  # type: ignore
                        except Exception as error:
                            warnings.warn(
                                f"Failed to load {model} with {embeddings_cls.__name__}: {error}\n"
                                f"Falling back to loading default provider model."
                            )
                            try:
                                # Try instantiating with the default provider model without the model identifier
                                embeddings_instance = embeddings_cls(**kwargs)
                            except Exception as error:
                                warnings.warn(
                                    f"Failed to load the default model for {embeddings_cls.__name__}: {error}\n"
                                    f"Falling back to SentenceTransformerEmbeddings."
                                )
    
            # If registry lookup and instantiation succeeded, return the instance
            if embeddings_instance:
                return embeddings_instance
    
            # If registry lookup and instantiation failed, return the default SentenceTransformerEmbeddings
            from .sentence_transformer import SentenceTransformerEmbeddings
            try:
                return SentenceTransformerEmbeddings(model, **kwargs)
            except Exception as e:
>               raise ValueError(f"Failed to load embeddings via SentenceTransformerEmbeddings after registry/fallback failure: {e}")
E               ValueError: Failed to load embeddings via SentenceTransformerEmbeddings after registry/fallback failure: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E               Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.../chonkie/embeddings/auto.py:109: ValueError

tests.handshakes.test_qdrant_handshake::test_qdrant_handshake_init_existing_collection

Stack Traces | 0.001s run time

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        <Tip warning={true}>
    
        Raises when the request has failed:
    
            - [`~utils.RepositoryNotFoundError`]
                If the repository to download from cannot be found. This may be because it
                doesn't exist, because `repo_type` is not set correctly, or because the repo
                is `private` and you do not have access.
            - [`~utils.GatedRepoError`]
                If the repository exists but is gated and the user is not on the authorized
                list.
            - [`~utils.RevisionNotFoundError`]
                If the repository exists but the revision couldn't be find.
            - [`~utils.EntryNotFoundError`]
                If the repository exists but the entry (e.g. the requested file) couldn't be
                find.
            - [`~utils.BadRequestError`]
                If request failed with a HTTP 400 BadRequest error.
            - [`~utils.HfHubHTTPError`]
                If request failed for a reason not listed above.
    
        </Tip>
        """
        try:
>           response.raise_for_status()

.venv/lib/python3.11.../huggingface_hub/utils/_http.py:409: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Response [429]>

    def raise_for_status(self):
        """Raises :class:`HTTPError`, if one occurred."""
    
        http_error_msg = ""
        if isinstance(self.reason, bytes):
            # We attempt to decode utf-8 first because some servers
            # choose to localize their reason strings. If the string
            # isn't utf-8, we fall back to iso-8859-1 for all other
            # encodings. (See PR #3538)
            try:
                reason = self.reason.decode("utf-8")
            except UnicodeDecodeError:
                reason = self.reason.decode("iso-8859-1")
        else:
            reason = self.reason
    
        if 400 <= self.status_code < 500:
            http_error_msg = (
                f"{self.status_code} Client Error: {reason} for url: {self.url}"
            )
    
        elif 500 <= self.status_code < 600:
            http_error_msg = (
                f"{self.status_code} Server Error: {reason} for url: {self.url}"
            )
    
        if http_error_msg:
>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.venv/lib/python3.11.../site-packages/requests/models.py:1024: HTTPError

The above exception was the direct cause of the following exception:

    def _get_metadata_or_catch_error(
        *,
        repo_id: str,
        filename: str,
        repo_type: str,
        revision: str,
        endpoint: Optional[str],
        proxies: Optional[Dict],
        etag_timeout: Optional[float],
        headers: Dict[str, str],  # mutated inplace!
        token: Union[bool, str, None],
        local_files_only: bool,
        relative_filename: Optional[str] = None,  # only used to store `.no_exists` in cache
        storage_folder: Optional[str] = None,  # only used to store `.no_exists` in cache
    ) -> Union[
        # Either an exception is caught and returned
        Tuple[None, None, None, None, None, Exception],
        # Or the metadata is returned as
        # `(url_to_download, etag, commit_hash, expected_size, xet_file_data, None)`
        Tuple[str, str, str, int, Optional[XetFileData], None],
    ]:
        """Get metadata for a file on the Hub, safely handling network issues.
    
        Returns either the etag, commit_hash and expected size of the file, or the error
        raised while fetching the metadata.
    
        NOTE: This function mutates `headers` inplace! It removes the `authorization` header
              if the file is a LFS blob and the domain of the url is different from the
              domain of the location (typically an S3 bucket).
        """
        if local_files_only:
            return (
                None,
                None,
                None,
                None,
                None,
                OfflineModeIsEnabled(
                    f"Cannot access file since 'local_files_only=True' as been set. (repo_id: {repo_id}, repo_type: {repo_type}, revision: {revision}, filename: {filename})"
                ),
            )
    
        url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
        url_to_download: str = url
        etag: Optional[str] = None
        commit_hash: Optional[str] = None
        expected_size: Optional[int] = None
        head_error_call: Optional[Exception] = None
        xet_file_data: Optional[XetFileData] = None
    
        # Try to get metadata from the server.
        # Do not raise yet if the file is not found or not accessible.
        if not local_files_only:
            try:
                try:
>                   metadata = get_hf_file_metadata(
                        url=url, proxies=proxies, timeout=etag_timeout, headers=headers, token=token
                    )

.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:1484: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.11.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:1401: in get_hf_file_metadata
    r = _request_wrapper(
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:285: in _request_wrapper
    response = _request_wrapper(
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:309: in _request_wrapper
    hf_raise_for_status(response)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        <Tip warning={true}>
    
        Raises when the request has failed:
    
            - [`~utils.RepositoryNotFoundError`]
                If the repository to download from cannot be found. This may be because it
                doesn't exist, because `repo_type` is not set correctly, or because the repo
                is `private` and you do not have access.
            - [`~utils.GatedRepoError`]
                If the repository exists but is gated and the user is not on the authorized
                list.
            - [`~utils.RevisionNotFoundError`]
                If the repository exists but the revision couldn't be find.
            - [`~utils.EntryNotFoundError`]
                If the repository exists but the entry (e.g. the requested file) couldn't be
                find.
            - [`~utils.BadRequestError`]
                If request failed with a HTTP 400 BadRequest error.
            - [`~utils.HfHubHTTPError`]
                If request failed for a reason not listed above.
    
        </Tip>
        """
        try:
            response.raise_for_status()
        except HTTPError as e:
            error_code = response.headers.get("X-Error-Code")
            error_message = response.headers.get("X-Error-Message")
    
            if error_code == "RevisionNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Revision Not Found for url: {response.url}."
                raise _format(RevisionNotFoundError, message, response) from e
    
            elif error_code == "EntryNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Entry Not Found for url: {response.url}."
                raise _format(EntryNotFoundError, message, response) from e
    
            elif error_code == "GatedRepo":
                message = (
                    f"{response.status_code} Client Error." + "\n\n" + f"Cannot access gated repo for url {response.url}."
                )
                raise _format(GatedRepoError, message, response) from e
    
            elif error_message == "Access to this resource is disabled.":
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Cannot access repository for url {response.url}."
                    + "\n"
                    + "Access to this resource is disabled."
                )
                raise _format(DisabledRepoError, message, response) from e
    
            elif error_code == "RepoNotFound" or (
                response.status_code == 401
                and error_message != "Invalid credentials in Authorization header"
                and response.request is not None
                and response.request.url is not None
                and REPO_API_REGEX.search(response.request.url) is not None
            ):
                # 401 is misleading as it is returned for:
                #    - private and gated repos if user is not authenticated
                #    - missing repos
                # => for now, we process them as `RepoNotFound` anyway.
                # See https://gist.github.com/Wauplin/46c27ad266b15998ce56a6603796f0b9
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Repository Not Found for url: {response.url}."
                    + "\nPlease make sure you specified the correct `repo_id` and"
                    " `repo_type`.\nIf you are trying to access a private or gated repo,"
                    " make sure you are authenticated. For more details, see"
                    " https://huggingface..../docs/huggingface_hub/authentication"
                )
                raise _format(RepositoryNotFoundError, message, response) from e
    
            elif response.status_code == 400:
                message = (
                    f"\n\nBad request for {endpoint_name} endpoint:" if endpoint_name is not None else "\n\nBad request:"
                )
                raise _format(BadRequestError, message, response) from e
    
            elif response.status_code == 403:
                message = (
                    f"\n\n{response.status_code} Forbidden: {error_message}."
                    + f"\nCannot access content at: {response.url}."
                    + "\nMake sure your token has the correct permissions."
                )
                raise _format(HfHubHTTPError, message, response) from e
    
            elif response.status_code == 416:
                range_header = response.request.headers.get("Range")
                message = f"{e}. Requested range: {range_header}. Content-Range: {response.headers.get('Content-Range')}."
                raise _format(HfHubHTTPError, message, response) from e
    
            # Convert `HTTPError` into a `HfHubHTTPError` to display request information
            # as well (request id and/or server error message)
>           raise _format(HfHubHTTPError, str(e), response) from e
E           huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.venv/lib/python3.11.../huggingface_hub/utils/_http.py:482: HfHubHTTPError

The above exception was the direct cause of the following exception:

path_or_repo_id = 'minishlab/potion-retrieval-32M', filenames = ['config.json']
cache_dir = '....../home/runner/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.51.0; python/3.11.12; session_id/d10118505c4c48c2af247ede6597ebeb; torch/2.6.0; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json', file_counter = 0

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`List[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
>               hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )

.venv/lib/python3.11.../transformers/utils/hub.py:424: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.11.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:961: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:1068: in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

head_call_error = HfHubHTTPError('429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json')
force_download = False, local_files_only = False

    def _raise_on_head_call_error(head_call_error: Exception, force_download: bool, local_files_only: bool) -> NoReturn:
        """Raise an appropriate error when the HEAD call failed and we cannot locate a local file."""
        # No head call => we cannot force download.
        if force_download:
            if local_files_only:
                raise ValueError("Cannot pass 'force_download=True' and 'local_files_only=True' at the same time.")
            elif isinstance(head_call_error, OfflineModeIsEnabled):
                raise ValueError("Cannot pass 'force_download=True' when offline mode is enabled.") from head_call_error
            else:
                raise ValueError("Force download failed due to the above error.") from head_call_error
    
        # No head call + couldn't find an appropriate file on disk => raise an error.
        if local_files_only:
            raise LocalEntryNotFoundError(
                "Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable"
                " hf.co look-ups and downloads online, set 'local_files_only' to False."
            )
        elif isinstance(head_call_error, (RepositoryNotFoundError, GatedRepoError)) or (
            isinstance(head_call_error, HfHubHTTPError) and head_call_error.response.status_code == 401
        ):
            # Repo not found or gated => let's raise the actual error
            # Unauthorized => likely a token issue => let's raise the actual error
            raise head_call_error
        else:
            # Otherwise: most likely a connection issue or Hub downtime => let's warn the user
>           raise LocalEntryNotFoundError(
                "An error happened while trying to locate the file on the Hub and we cannot find the requested files"
                " in the local cache. Please check your connection and try again or make sure your Internet connection"
                " is on."
            ) from head_call_error
E           huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:1599: LocalEntryNotFoundError

The above exception was the direct cause of the following exception:

cls = <class 'chonkie.embeddings.auto.AutoEmbeddings'>
model = 'minishlab/potion-retrieval-32M', kwargs = {}
embeddings_instance = None, embeddings_cls = None
SentenceTransformerEmbeddings = <class 'chonkie.embeddings.sentence_transformer.SentenceTransformerEmbeddings'>

    @classmethod
    def get_embeddings(cls, model: Union[str, BaseEmbeddings, Any], **kwargs: Any) -> BaseEmbeddings:
        """Get embeddings instance based on identifier.
    
        Args:
            model: Identifier for the embeddings (name, path, URL, etc.)
            **kwargs: Additional arguments passed to the embeddings constructor
    
        Returns:
            Initialized embeddings instance
    
        Raises:
            ValueError: If no suitable embeddings implementation is found
    
        Examples:
            # Get sentence transformers embeddings
            embeddings = AutoEmbeddings.get_embeddings("sentence-transformers/all-MiniLM-L6-v2")
    
            # Get OpenAI embeddings
            embeddings = AutoEmbeddings.get_embeddings("openai://text-embedding-ada-002", api_key="...")
    
            # Get Anthropic embeddings
            embeddings = AutoEmbeddings.get_embeddings("anthropic://claude-v1", api_key="...")
    
            # Get Cohere embeddings
            embeddings = AutoEmbeddings.get_embeddings("cohere://embed-english-light-v3.0", api_key="...")
    
        """
        # Load embeddings instance if already provided
        if isinstance(model, BaseEmbeddings):
            return model
        elif isinstance(model, str):
            # Initializing the embedding instance
            embeddings_instance = None
    
            # Check if the user passed in a provider alias
            if "://" in model:
                provider, model_name = model.split("://")
                embeddings_cls = EmbeddingsRegistry.get_provider(provider)
                if embeddings_cls:
                    try:
                        return embeddings_cls(model_name, **kwargs)  # type: ignore
                    except Exception as error:
                        raise ValueError(f"Failed to load {model} with {embeddings_cls.__name__}, with error: {error}")
                else:
                    raise ValueError(f"No provider found for {provider}. Please check the provider name and try again.")
            else:
                # Try to find matching implementation via registry
                embeddings_cls = EmbeddingsRegistry.match(model)
                if embeddings_cls:
                        try:
                            # Try instantiating with the model identifier
                            embeddings_instance = embeddings_cls(model, **kwargs)  # type: ignore
                        except Exception as error:
                            warnings.warn(
                                f"Failed to load {model} with {embeddings_cls.__name__}: {error}\n"
                                f"Falling back to loading default provider model."
                            )
                            try:
                                # Try instantiating with the default provider model without the model identifier
                                embeddings_instance = embeddings_cls(**kwargs)
                            except Exception as error:
                                warnings.warn(
                                    f"Failed to load the default model for {embeddings_cls.__name__}: {error}\n"
                                    f"Falling back to SentenceTransformerEmbeddings."
                                )
    
            # If registry lookup and instantiation succeeded, return the instance
            if embeddings_instance:
                return embeddings_instance
    
            # If registry lookup and instantiation failed, return the default SentenceTransformerEmbeddings
            from .sentence_transformer import SentenceTransformerEmbeddings
            try:
>               return SentenceTransformerEmbeddings(model, **kwargs)

.../chonkie/embeddings/auto.py:107: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../chonkie/embeddings/sentence_transformer.py:49: in __init__
    self.model = SentenceTransformer(self.model_name_or_path, **kwargs)
.venv/lib/python3.11....../site-packages/sentence_transformers/SentenceTransformer.py:321: in __init__
    modules = self._load_auto_model(
.venv/lib/python3.11....../site-packages/sentence_transformers/SentenceTransformer.py:1600: in _load_auto_model
    transformer_model = Transformer(
.venv/lib/python3.11.../sentence_transformers/models/Transformer.py:80: in __init__
    config, is_peft_model = self._load_config(model_name_or_path, cache_dir, backend, config_args)
.venv/lib/python3.11.../sentence_transformers/models/Transformer.py:145: in _load_config
    return AutoConfig.from_pretrained(model_name_or_path, **config_args, cache_dir=cache_dir), False
.venv/lib/python3.11.../models/auto/configuration_auto.py:1112: in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
.venv/lib/python3.11....../site-packages/transformers/configuration_utils.py:590: in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
.venv/lib/python3.11....../site-packages/transformers/configuration_utils.py:649: in _get_config_dict
    resolved_config_file = cached_file(
.venv/lib/python3.11.../transformers/utils/hub.py:266: in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path_or_repo_id = 'minishlab/potion-retrieval-32M', filenames = ['config.json']
cache_dir = '....../home/runner/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.51.0; python/3.11.12; session_id/d10118505c4c48c2af247ede6597ebeb; torch/2.6.0; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json', file_counter = 0

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`List[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
                hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
       
8D6A
             proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
            else:
                snapshot_download(
                    path_or_repo_id,
                    allow_patterns=full_filenames,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
    
        except Exception as e:
            # We cannot recover from them
            if isinstance(e, RepositoryNotFoundError) and not isinstance(e, GatedRepoError):
                raise OSError(
                    f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
                    "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to pass a token "
                    "having permission to this repo either by logging in with `huggingface-cli login` or by passing "
                    "`token=<your_token>`"
                ) from e
            elif isinstance(e, RevisionNotFoundError):
                raise OSError(
                    f"{revision} is not a valid git identifier (branch name, tag name or commit id) that exists "
                    "for this model name. Check the model page at "
                    f"'https://huggingface.co/{path_or_repo_id}' for available revisions."
                ) from e
    
            # Now we try to recover if we can find all files correctly in the cache
            resolved_files = [
                _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames
            ]
            if all(file is not None for file in resolved_files):
                return resolved_files
    
            # Raise based on the flags. Note that we will raise for missing entries at the very end, even when
            # not entering this Except block, as it may also happen when `snapshot_download` does not raise
            if isinstance(e, GatedRepoError):
                if not _raise_exceptions_for_gated_repo:
                    return None
                raise OSError(
                    "You are trying to access a gated repo.\nMake sure to have access to it at "
                    f"https://huggingface.co/{path_or_repo_id}.\n{str(e)}"
                ) from e
            elif isinstance(e, LocalEntryNotFoundError):
                if not _raise_exceptions_for_connection_errors:
                    return None
                # Here we only raise if both flags for missing entry and connection errors are True (because it can be raised
                # even when `local_files_only` is True, in which case raising for connections errors only would not make sense)
                elif _raise_exceptions_for_missing_entries:
>                   raise OSError(
                        f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load the files, and couldn't find them in the"
                        f" cached files.\nCheckout your internet connection or see how to run the library in offline mode at"
                        " 'https://huggingface..../docs/transformers/installation#offline-mode'."
                    ) from e
E                   OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E                   Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.venv/lib/python3.11.../transformers/utils/hub.py:491: OSError

During handling of the above exception, another exception occurred:

    @pytest.fixture(scope="module")
    def real_embeddings() -> BaseEmbeddings:
        """Provide an instance of the actual default embedding model."""
        # Use scope="module" to load the model only once per test module run
        # Set environment variable to potentially avoid Hugging Face Hub login prompts in some CI environments
        os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
>       return AutoEmbeddings.get_embeddings(DEFAULT_EMBEDDING_MODEL)

tests/handshakes/test_qdrant_handshake.py:61: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'chonkie.embeddings.auto.AutoEmbeddings'>
model = 'minishlab/potion-retrieval-32M', kwargs = {}
embeddings_instance = None, embeddings_cls = None
SentenceTransformerEmbeddings = <class 'chonkie.embeddings.sentence_transformer.SentenceTransformerEmbeddings'>

    @classmethod
    def get_embeddings(cls, model: Union[str, BaseEmbeddings, Any], **kwargs: Any) -> BaseEmbeddings:
        """Get embeddings instance based on identifier.
    
        Args:
            model: Identifier for the embeddings (name, path, URL, etc.)
            **kwargs: Additional arguments passed to the embeddings constructor
    
        Returns:
            Initialized embeddings instance
    
        Raises:
            ValueError: If no suitable embeddings implementation is found
    
        Examples:
            # Get sentence transformers embeddings
            embeddings = AutoEmbeddings.get_embeddings("sentence-transformers/all-MiniLM-L6-v2")
    
            # Get OpenAI embeddings
            embeddings = AutoEmbeddings.get_embeddings("openai://text-embedding-ada-002", api_key="...")
    
            # Get Anthropic embeddings
            embeddings = AutoEmbeddings.get_embeddings("anthropic://claude-v1", api_key="...")
    
            # Get Cohere embeddings
            embeddings = AutoEmbeddings.get_embeddings("cohere://embed-english-light-v3.0", api_key="...")
    
        """
        # Load embeddings instance if already provided
        if isinstance(model, BaseEmbeddings):
            return model
        elif isinstance(model, str):
            # Initializing the embedding instance
            embeddings_instance = None
    
            # Check if the user passed in a provider alias
            if "://" in model:
                provider, model_name = model.split("://")
                embeddings_cls = EmbeddingsRegistry.get_provider(provider)
                if embeddings_cls:
                    try:
                        return embeddings_cls(model_name, **kwargs)  # type: ignore
                    except Exception as error:
                        raise ValueError(f"Failed to load {model} with {embeddings_cls.__name__}, with error: {error}")
                else:
                    raise ValueError(f"No provider found for {provider}. Please check the provider name and try again.")
            else:
                # Try to find matching implementation via registry
                embeddings_cls = EmbeddingsRegistry.match(model)
                if embeddings_cls:
                        try:
                            # Try instantiating with the model identifier
                            embeddings_instance = embeddings_cls(model, **kwargs)  # type: ignore
                        except Exception as error:
                            warnings.warn(
                                f"Failed to load {model} with {embeddings_cls.__name__}: {error}\n"
                                f"Falling back to loading default provider model."
                            )
                            try:
                                # Try instantiating with the default provider model without the model identifier
                                embeddings_instance = embeddings_cls(**kwargs)
                            except Exception as error:
                                warnings.warn(
                                    f"Failed to load the default model for {embeddings_cls.__name__}: {error}\n"
                                    f"Falling back to SentenceTransformerEmbeddings."
                                )
    
            # If registry lookup and instantiation succeeded, return the instance
            if embeddings_instance:
                return embeddings_instance
    
            # If registry lookup and instantiation failed, return the default SentenceTransformerEmbeddings
            from .sentence_transformer import SentenceTransformerEmbeddings
            try:
                return SentenceTransformerEmbeddings(model, **kwargs)
            except Exception as e:
>               raise ValueError(f"Failed to load embeddings via SentenceTransformerEmbeddings after registry/fallback failure: {e}")
E               ValueError: Failed to load embeddings via SentenceTransformerEmbeddings after registry/fallback failure: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E               Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.../chonkie/embeddings/auto.py:109: ValueError

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

greptile-apps

PR Summary

This PR introduces significant performance optimizations through Cython extensions for text chunking operations, along with support for Google's Gemini embedding models.

Added Cython extensions split.pyx and merge.pyx for optimized text chunking operations, claiming 48-50% performance improvement
Implemented GeminiEmbeddings class with comprehensive retry logic, token counting, and support for the latest Gemini embedding models
Added LRU caching (maxsize=8192) in OverlapRefinery for tokenization operations to improve performance
Introduced provider-based syntax for loading embeddings (e.g., gemini://) in AutoEmbeddings with improved error handling
Comprehensive test coverage added for new features including mocked API responses and real integration tests

_{81 file(s) reviewed, 4 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-05-25T22:46:12Z

.gitignore

+notebooks/*
+src/chonkie/chunker/c_extensions/token_chunker.cpython-310-darwin.so
+CLAUDE.md
+.temp/*


style: Duplicate entries for .temp/* - remove one of them to avoid confusion

Suggested change

/.temp/*

notebooks/*

src/chonkie/chunker/c_extensions/token_chunker.cpython-310-darwin.so

CLAUDE.md

.temp/*

/.temp/*

notebooks/*

src/chonkie/chunker/c_extensions/token_chunker.cpython-310-darwin.so

CLAUDE.md

greptile-apps · 2025-05-25T22:46:13Z

.gitignore

+/.temp/*
+notebooks/*
+src/chonkie/chunker/c_extensions/token_chunker.cpython-310-darwin.so


style: This platform-specific pattern should be replaced with a more generic .cpython-.so

greptile-apps · 2025-05-25T22:46:47Z

pyproject.toml

@@ -1,5 +1,5 @@
 [build-system]
-requires = ["setuptools>=45", "wheel"]
+requires = ["setuptools>=45", "wheel", "cython>=3.0.0"]


style: Consider pinning Cython to a specific version range (e.g. cython>=3.0.0,<4.0.0) to prevent future compatibility issues

greptile-apps · 2025-05-25T22:47:25Z

src/chonkie/chunker/c_extensions/split.pyx

+    if delim is None:
+        if whitespace_mode:
+            # Split on whitespace - for word-level splitting
+            splits = text.split(" ")  # Split on spaces specifically, not all whitespace


logic: splitting on space character only may miss other whitespace characters like tabs and newlines. Consider using str.split() without arguments for all whitespace

chonknick and others added 30 commits May 17, 2025 00:36

Add compiled .so and .c autogen files

9c848aa

Fix pyproject.toml issues + add a token_chunker.pyx file for checking…

3473565

… if we can speed it up!

Update DOCS for OpenAIGenie — as an experiment

52cc725

Remove deprecated test file for Cython token chunker

2d4c549

- Deleted `test_cython_token_chunker.py`, which contained tests for the Cython token chunking functionality that is no longer in use. - This cleanup helps streamline the codebase by removing obsolete tests.

Refactor embedding registration and loading logic in AutoEmbeddings a…

a7ec5bc

…nd EmbeddingsRegistry. Introduced provider alias support, improved error handling, and streamlined model registration methods for better clarity and maintainability.

Add RAGHub

c29f307

Update src/chonkie/embeddings/registry.py

99a5ff7

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Initialize embedding instance in AutoEmbeddings class and add fallbac…

bcd3a49

…k for default SentenceTransformerEmbeddings in case of registry lookup failure.

Refactor embedding registration in EmbeddingsRegistry to use register…

1ef5f3c

…_model for jina-embeddings-v2 types, enhancing consistency in model registration.

Refactor variable naming in EmbeddingsRegistry to improve clarity, ch…

30dd5ee

…anging 'type' to 'type_alias' for better readability in the embedding registration process.

Fix contributing.md

b6ed0a6

Fix: Improve error message on embedding model loading failiure

7df8d08

Fix: Add type ignore comment for embeddings class instantiation to su…

85d32fd

…ppress type checking errors

Fix: Update string representation in Model2VecEmbeddings for improved…

10015e6

… clarity

Enhance chunker module by adding NeuralChunker and SlumberChunker to …

6ccba73

…the imports and __all__ list for improved functionality.

Enhance error handling in NeuralChunker by adding exception handling …

3adeb7b

…for API connectivity issues and improving error messages for better user guidance. Ensure clarity in API key requirements and response handling.

Remove API_KEY from class variables in NeuralChunker, RecursiveChunke…

f53756a

…r, SentenceChunker, and TokenChunker for improved security and consistency across chunker implementations.

Remove API_KEY from SemanticChunker class variable for improved secur…

d9ec7ba

…ity and consistency across chunker implementations.

Update error messages in NeuralChunker to provide clearer guidance on…

43d8356

… API connectivity issues and invalid responses, enhancing user experience and support contact information.

Update version to 1.0.8 in pyproject.toml and __init__.py for release.

b4c0961

Add comprehensive tests for OverlapRefinery to improve test coverage

ae335af

Fix pickling issue in BaseTokenizer by using a named method instead o…

ecafc20

…f lambda

chonknick and others added 24 commits May 25, 2025 15:30

Update tests/cloud/test_cloud_code_chunker.py

d71d75d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Refactor test functions in test_json_porter.py to include return type…

320b0e4

… annotations - Added return type annotations to all test functions for improved clarity and type checking. - Updated the temp_dir fixture to specify its return type as a Generator.

Disable multiprocessing in CodeChunker for improved performance and s…

9b972a4

…tability

Update .gitignore to exclude temporary files

838b7a1

- Added .temp/* to the .gitignore file to prevent temporary files from being tracked in the repository.

Enhance type hints in embedding test files

22c64de

- Updated type hints in `mock_tokenizer` and `mock_process_batch` functions to improve code clarity and type checking. - Ensured consistent use of type annotations for better maintainability in test cases.

Add GeminiEmbeddings support

337f1a1

- Introduced GeminiEmbeddings to the embeddings module. - Updated import statements and __all__ exports to include GeminiEmbeddings. - Registered GeminiEmbeddings in the EmbeddingsRegistry with associated patterns and models for enhanced functionality.

Add Gemini Embeddings tutorial to README

928d83a

- Included a new section in the README for using Google's Gemini embedding models with Chonkie's RecursiveChunker. - Provided a link to a tutorial demonstrating high-quality text embeddings and similarity analysis with Gemini embeddings.

Remove unused timeout parameter from GeminiEmbeddings initialization

9fb2e4b

Fix: remove unused parameters

7f96d34

Enhance type hints in cloud code chunker tests

5037f11

- Updated type hints for mock API response and test functions to improve code clarity and type checking. - Ensured consistent use of type annotations across all test functions for better maintainability.

Enhance OverlapRefinery performance by implementing LRU caching for t…

c66deab

…okenization and token count operations. Added methods to manage cache, including cache_info and clear_cache, to optimize repeated processing of similar text. Updated docstrings for clarity on caching behavior.

Merge branch 'development' into test-support-for-cython

93e34a6

gemini-code-assist bot reviewed May 25, 2025

View reviewed changes

gemini-code-assist bot suggested changes May 25, 2025

View reviewed changes

greptile-apps bot reviewed May 25, 2025

View reviewed changes

chonknick merged commit c006bee into development May 25, 2025
0 of 5 checks passed

chonknick deleted the test-support-for-cython branch May 26, 2025 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: Create Cython functions for `split` and `merge` basic ops for chunking! #163

Feat: Create Cython functions for `split` and `merge` basic ops for chunking! #163

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Feat: Create Cython functions for split and merge basic ops for chunking! #163

Feat: Create Cython functions for split and merge basic ops for chunking! #163

Uh oh!

Conversation

New Features and Enhancements:

Documentation Updates:

Miscellaneous Changes:

Uh oh!

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

❌ 36 Tests Failed:

Uh oh!

Choose a reason for hiding this comment

PR Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Feat: Create Cython functions for `split` and `merge` basic ops for chunking! #163

Feat: Create Cython functions for `split` and `merge` basic ops for chunking! #163