Feat: Add cloud `CodeChunker` + tests #157

chonknick · 2025-05-23T01:00:37Z

This pull request introduces the CodeChunker class to the Chonkie Cloud API for chunking code, along with corresponding updates to the module's initialization files and a comprehensive set of tests. The most important changes include the implementation of the CodeChunker class, updates to module imports and exports, and the addition of extensive unit tests to ensure functionality.

New Feature: CodeChunker Implementation

Added the CodeChunker class in src/chonkie/cloud/chunker/code.py, which provides functionality for chunking code using the Chonkie API. It supports configurable chunk sizes, tokenizers, languages, and return types. The class validates input parameters, handles API requests, and processes responses.

Updates to Module Initialization

Updated src/chonkie/cloud/__init__.py and src/chonkie/cloud/chunker/__init__.py to include the CodeChunker class in imports and exports, ensuring it is accessible as part of the module's public API. [1] [2] [3] [4]

Unit Tests for CodeChunker

Added a test suite in tests/cloud/test_cloud_code_chunker.py to validate the functionality of the CodeChunker class. Tests cover initialization, parameter validation, chunking behavior for various programming languages, batch processing, adherence to chunk size limits, and continuity of chunk indices. Additionally, tests ensure compatibility with different tokenizers and handle edge cases like empty or whitespace-only inputs.- Introduced CodeChunker to the cloud chunker module.

Updated init.py files to include CodeChunker in the exports for both the cloud and chunker modules.

- Introduced CodeChunker to the cloud chunker module. - Updated __init__.py files to include CodeChunker in the exports for both the cloud and chunker modules.

gemini-code-assist

Hello @chonknick, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! I'm gemini-code-assist, here to provide a summary of this pull request. This PR introduces a new CodeChunker class to the Chonkie Cloud API client library. The primary goal is to provide a dedicated chunker specifically designed for processing code, leveraging the Chonkie API's capabilities for code-aware splitting. The changes involve adding the new class implementation, making it accessible through the module's public API, and adding a comprehensive suite of tests to ensure its functionality and robustness.

Highlights

New Feature: CodeChunker: Adds the CodeChunker class (src/chonkie/cloud/chunker/code.py) which interacts with the Chonkie Cloud API to perform code-specific chunking. It supports configuration for tokenizer, chunk size, language (including auto-detection), and return type ('texts' or 'chunks'). It includes validation for input parameters and handles API key management (via constructor or environment variable).
Module Exports Updated: Updates src/chonkie/cloud/__init__.py and src/chonkie/cloud/chunker/__init__.py to import and export the new CodeChunker class, making it part of the public API for the chonkie.cloud and chonkie.cloud.chunker modules.
Comprehensive Test Suite: Adds a new test file (tests/cloud/test_cloud_code_chunker.py) with extensive unit tests for the CodeChunker. Tests cover initialization validation, chunking for simple and complex code (Python, JavaScript), auto language detection, batch processing, handling empty/whitespace input, adherence to chunk size limits, continuity of chunk indices, and compatibility with different tokenizers. Tests are marked to be skipped if the CHONKIE_API_KEY environment variable is not set.

Changelog

Click here to see the changelog

src/chonkie/cloud/init.py
- Added CodeChunker to the import list from .chunker (line 5).
- Added CodeChunker to the __all__ export list (line 24).
src/chonkie/cloud/chunker/init.py
- Added import for CodeChunker from .code (line 4).
- Added CodeChunker to the __all__ export list (line 22).
src/chonkie/cloud/chunker/code.py
- Added the CodeChunker class definition.
- Implemented the __init__ method to handle API key, validate chunk_size and return_type, set instance attributes, and check API reachability.
- Implemented the chunk method to construct the API request payload, make a POST request to the /v1/chunk/code endpoint, handle API errors, and parse the JSON response.
- Implemented the __call__ method as an alias for the chunk method.
tests/cloud/test_cloud_code_chunker.py
- Added a new test file for CodeChunker.
- Added python_code and js_code pytest fixtures.
- Added test_cloud_code_chunker_initialization to test constructor validation.
- Added test_cloud_code_chunker_simple for basic chunking.
- Added test_cloud_code_chunker_python_complex for more complex Python code, including reconstruction check.
- Added test_cloud_code_chunker_javascript for JavaScript code, including reconstruction check.
- Added test_cloud_code_chunker_auto_language to test auto-detection.
- Added test_cloud_code_chunker_no_nodes_support to confirm node output is not expected (due to API).
- Added test_cloud_code_chunker_batch to test processing a list of texts.
- Added test_cloud_code_chunker_return_type_texts to test the 'texts' output format.
- Added test_cloud_code_chunker_empty_text and test_cloud_code_chunker_whitespace_text for edge cases.
- Added test_cloud_code_chunker_chunk_size_adherence to check token count limits.
- Added test_cloud_code_chunker_indices_continuity to verify start/end indices.
- Added test_cloud_code_chunker_different_tokenizers to test tokenizer options.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Code flows like a stream,
Chunked neatly, a programmer's dream.
Tests confirm the split,
Ensuring it is fit,
For models, a useful scheme.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

greptile-apps

PR Summary

Added a new CodeChunker class to the Chonkie Cloud API for intelligent code chunking with API integration and extensive test coverage.

Implemented src/chonkie/cloud/chunker/code.py with API key validation, error handling, and support for multiple languages and tokenizers
Added comprehensive test suite in tests/cloud/test_cloud_code_chunker.py covering edge cases, batch processing, and chunk size adherence
Included API health check in constructor to provide immediate feedback if service is unavailable
Added proper type hints and docstrings throughout the implementation
Implemented chunk reconstruction validation to ensure lossless code splitting

_{4 file(s) reviewed, 4 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-05-23T01:00:57Z

src/chonkie/cloud/chunker/__init__.py

@@ -18,6 +19,7 @@
    "SentenceChunker",
    "LateChunker",
    "SDPMChunker",
+    "CodeChunker",


style: Consider moving 'CodeChunker' between 'CloudChunker' and 'LateChunker' to maintain alphabetical ordering in all

greptile-apps · 2025-05-23T01:01:03Z

src/chonkie/cloud/chunker/code.py

+        # Check if the API is up right now
+        response = requests.get(f"{self.BASE_URL}/")
+        if response.status_code != 200:
+            raise ValueError(
+                "Oh no! You caught Chonkie at a bad time. It seems to be down right now."
+                + " Please try again in a short while."
+                + " If the issue persists, please contact support at support@chonkie.ai or raise an issue on GitHub."
+            )


style: API health check in constructor could cause initialization failures in offline environments or when API is temporarily down. Consider moving this to a separate method that can be called explicitly.

src/chonkie/cloud/chunker/code.py

greptile-apps · 2025-05-23T01:01:05Z

tests/cloud/test_cloud_code_chunker.py

+    result = code_chunker("   \n\t\n  ")
+    # Should return empty list or minimal chunks
+    assert isinstance(result, list)


logic: Missing assertion for expected behavior with whitespace-only input. Should explicitly assert empty list or specific chunk count.

Suggested change

result = code_chunker(" \n\t\n ")

# Should return empty list or minimal chunks

assert isinstance(result, list)

result = code_chunker(" \n\t\n ")

# Whitespace-only input should return either empty list or single chunk

assert isinstance(result, list)

assert len(result) <= 1, "Whitespace-only input should produce at most one chunk"

gemini-code-assist

Code Review

This pull request introduces the CodeChunker for the Chonkie Cloud API, along with comprehensive unit tests. The implementation is generally solid, and the test coverage is good, addressing various languages, batching, and edge cases.

I have a few suggestions focusing on improving type safety for batch operations, the behavior of the API health check in the constructor, exception handling, and test assertion strength. Addressing the high-severity type mismatch issue is particularly important before merging.

Well done on the detailed test suite!

Summary of Findings

Return Type Mismatch for Batch Processing: The chunk method in CodeChunker has a return type annotation List[Dict], but for batch input (List[str]), it actually returns List[List[Dict]]. This mismatch affects type safety and clarity. (File: src/chonkie/cloud/chunker/code.py, Lines: 67, 105)
API Health Check in Constructor: Performing an API health check within the __init__ method of CodeChunker can lead to performance issues during instantiation and make the class unusable if the API is temporarily down. (File: src/chonkie/cloud/chunker/code.py, Lines: 59-65)
Generic Exception Handling: The chunk method catches a broad Exception when parsing the API response, which can hide specific errors. It's recommended to catch more specific exceptions like requests.exceptions.JSONDecodeError. (File: src/chonkie/cloud/chunker/code.py, Line: 106)
Ambiguous Type Hint for tokenizer_or_token_counter: The type hint Union[str, List] for tokenizer_or_token_counter in CodeChunker is unclear regarding how a List would be used or if the API supports it for this parameter. (File: src/chonkie/cloud/chunker/code.py, Line: 19)
Weak Assertion in Whitespace Text Test: The test test_cloud_code_chunker_whitespace_text has a weak assertion (isinstance(result, list)). It should be strengthened to verify the expected output, likely an empty list. (File: tests/cloud/test_cloud_code_chunker.py, Line: 291-293)
Missing API Error Handling Tests: The test suite for CodeChunker lacks tests that mock API failure scenarios (e.g., 4xx/5xx errors) to ensure proper exception handling by the client. This was not commented on directly due to review settings. (File: tests/cloud/test_cloud_code_chunker.py)
Minor PEP 8/Stylistic Issues: Minor stylistic issues like missing newlines at the end of files and use of string concatenation instead of f-strings were observed. These were not commented on directly due to review settings. (Files: src/chonkie/cloud/chunker/code.py, tests/cloud/test_cloud_code_chunker.py)

Merge Readiness

The pull request is a significant contribution, but there is a high-severity issue regarding the return type of the chunk method when handling batch inputs that should be addressed before merging. Additionally, considering the medium-severity feedback would further improve the robustness and usability of the CodeChunker.

Once the high-severity issue is resolved and other feedback points are considered, this PR will be in a much stronger position for merging. As a reviewer, I am not authorized to approve pull requests; please ensure further review and approval from authorized maintainers.

gemini-code-assist · 2025-05-23T01:04:00Z

src/chonkie/cloud/chunker/code.py

+                + " If the issue persists, please contact support at support@chonkie.ai or raise an issue on GitHub."
+            )
+
+    def chunk(self, text: Union[str, List[str]]) -> List[Dict]:


The current return type annotation -> List[Dict] appears to be incorrect when the input text parameter is a List[str] (for batch processing).

Your tests (specifically test_cloud_code_chunker_batch) correctly assert that for a list of input texts, the result is a list of lists of chunks (List[List[Dict]]). However, this method's signature and the cast on line 105 (result: List[Dict] = cast(List[Dict], response.json())) do not reflect this batch behavior.

This discrepancy can lead to type errors and confusion for users of the CodeChunker.

To address this, you could:

Change the return type annotation to Union[List[Dict], List[List[Dict]]].

Adjust the logic around line 105 to correctly cast or type response.json() based on whether the input text was a single string or a list of strings.

For a more type-safe API, consider using typing.overload to define distinct signatures for single string input and list-of-strings input.

At a minimum, the return type annotation should be updated to reflect the possible List[List[Dict]] structure.

Suggested change

def chunk(self, text: Union[str, List[str]]) -> List[Dict]:

def chunk(self, text: Union[str, List[str]]) -> Union[List[Dict], List[List[Dict]]]:

gemini-code-assist · 2025-05-23T01:04:00Z

src/chonkie/cloud/chunker/code.py

+
+    def __init__(
+        self,
+        tokenizer_or_token_counter: str = "gpt2",


The type hint Union[str, List] for tokenizer_or_token_counter is a bit ambiguous in the context of this cloud client.

If List is intended to allow, for example, a list of pre-computed token IDs (List[int]) or a list of tokenizer names for some advanced API feature, the payload ("tokenizer_or_token_counter": self.tokenizer_or_token_counter) would send this list. Does the Chonkie API endpoint /v1/chunk/code actually support receiving a list for this parameter? The current tests only demonstrate usage with string values (e.g., "gpt2").

If the API endpoint for this cloud chunker only accepts a string for tokenizer_or_token_counter, then including List in the Union might be misleading for users of this specific class.

Could you clarify the intended use and API support for List here? If it's only ever a string for this cloud chunker, str might be a more precise type hint. If List is supported, adding a test case for it would be beneficial.

gemini-code-assist · 2025-05-23T01:04:00Z

src/chonkie/cloud/chunker/code.py

+        response = requests.get(f"{self.BASE_URL}/")
+        if response.status_code != 200:
+            raise ValueError(
+                "Oh no! You caught Chonkie at a bad time. It seems to be down right now."
+                + " Please try again in a short while."
+                + " If the issue persists, please contact support at support@chonkie.ai or raise an issue on GitHub."
+            )


The API health check (requests.get(f"{self.BASE_URL}/")) performed in the __init__ method could introduce some potential issues:

Performance Impact: It adds network latency each time a CodeChunker object is instantiated.

Usability Concern: If the API's base endpoint (/) is temporarily unresponsive, CodeChunker objects cannot be created. This could be problematic even if the user only intends to configure the object without immediately calling chunk().

Resource Usage: Frequent instantiation could lead to unnecessary network traffic to the health check endpoint.

Have you considered alternatives, such as:

Performing this check lazily, just before the first actual API call in the chunk() method?

Making it an optional, explicit health check method that users can invoke if they need to verify connectivity?

This change could improve instantiation performance and make the class more resilient to transient network or API issues.

gemini-code-assist · 2025-05-23T01:04:00Z

src/chonkie/cloud/chunker/code.py

+        except Exception as error:
+            raise ValueError(f"Error parsing the response: {error}") from error


Catching a generic Exception when parsing the JSON response is quite broad. This can make debugging more difficult as it might catch and obscure unrelated errors that are not specific to JSON decoding.

Would it be possible to catch a more specific exception here? For instance, requests.exceptions.JSONDecodeError (if response.json() from the requests library is used and can raise this) or json.JSONDecodeError would be more targeted to issues during the parsing of the JSON response.

Suggested change

except Exception as error:

raise ValueError(f"Error parsing the response: {error}") from error

except requests.exceptions.JSONDecodeError as error:

raise ValueError(f"Error parsing the response: {error}") from error

tests/cloud/test_cloud_code_chunker.py

codecov · 2025-05-23T01:06:53Z

❌ 276 Tests Failed:

Tests completed	Failed	Passed	Skipped
2676	276	2400	56

View the top 3 failed test(s) by shortest run time

tests.handshakes.test_qdrant_handshake::test_qdrant_handshake_write_multiple_chunks

Stack Traces | 0s run time

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        <Tip warning={true}>
    
        Raises when the request has failed:
    
            - [`~utils.RepositoryNotFoundError`]
                If the repository to download from cannot be found. This may be because it
                doesn't exist, because `repo_type` is not set correctly, or because the repo
                is `private` and you do not have access.
            - [`~utils.GatedRepoError`]
                If the repository exists but is gated and the user is not on the authorized
                list.
            - [`~utils.RevisionNotFoundError`]
                If the repository exists but the revision couldn't be find.
            - [`~utils.EntryNotFoundError`]
                If the repository exists but the entry (e.g. the requested file) couldn't be
                find.
            - [`~utils.BadRequestError`]
                If request failed with a HTTP 400 BadRequest error.
            - [`~utils.HfHubHTTPError`]
                If request failed for a reason not listed above.
    
        </Tip>
        """
        try:
>           response.raise_for_status()

.venv/lib/python3.10.../huggingface_hub/utils/_http.py:409: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Response [429]>

    def raise_for_status(self):
        """Raises :class:`HTTPError`, if one occurred."""
    
        http_error_msg = ""
        if isinstance(self.reason, bytes):
            # We attempt to decode utf-8 first because some servers
            # choose to localize their reason strings. If the string
            # isn't utf-8, we fall back to iso-8859-1 for all other
            # encodings. (See PR #3538)
            try:
                reason = self.reason.decode("utf-8")
            except UnicodeDecodeError:
                reason = self.reason.decode("iso-8859-1")
        else:
            reason = self.reason
    
        if 400 <= self.status_code < 500:
            http_error_msg = (
                f"{self.status_code} Client Error: {reason} for url: {self.url}"
            )
    
        elif 500 <= self.status_code < 600:
            http_error_msg = (
                f"{self.status_code} Server Error: {reason} for url: {self.url}"
            )
    
        if http_error_msg:
>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.venv/lib/python3.10.../site-packages/requests/models.py:1024: HTTPError

The above exception was the direct cause of the following exception:

    def _get_metadata_or_catch_error(
        *,
        repo_id: str,
        filename: str,
        repo_type: str,
        revision: str,
        endpoint: Optional[str],
        proxies: Optional[Dict],
        etag_timeout: Optional[float],
        headers: Dict[str, str],  # mutated inplace!
        token: Union[bool, str, None],
        local_files_only: bool,
        relative_filename: Optional[str] = None,  # only used to store `.no_exists` in cache
        storage_folder: Optional[str] = None,  # only used to store `.no_exists` in cache
    ) -> Union[
        # Either an exception is caught and returned
        Tuple[None, None, None, None, None, Exception],
        # Or the metadata is returned as
        # `(url_to_download, etag, commit_hash, expected_size, xet_file_data, None)`
        Tuple[str, str, str, int, Optional[XetFileData], None],
    ]:
        """Get metadata for a file on the Hub, safely handling network issues.
    
        Returns either the etag, commit_hash and expected size of the file, or the error
        raised while fetching the metadata.
    
        NOTE: This function mutates `headers` inplace! It removes the `authorization` header
              if the file is a LFS blob and the domain of the url is different from the
              domain of the location (typically an S3 bucket).
        """
        if local_files_only:
            return (
                None,
                None,
                None,
                None,
                None,
                OfflineModeIsEnabled(
                    f"Cannot access file since 'local_files_only=True' as been set. (repo_id: {repo_id}, repo_type: {repo_type}, revision: {revision}, filename: {filename})"
                ),
            )
    
        url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
        url_to_download: str = url
        etag: Optional[str] = None
        commit_hash: Optional[str] = None
        expected_size: Optional[int] = None
        head_error_call: Optional[Exception] = None
        xet_file_data: Optional[XetFileData] = None
    
        # Try to get metadata from the server.
        # Do not raise yet if the file is not found or not accessible.
        if not local_files_only:
            try:
                try:
>                   metadata = get_hf_file_metadata(
                        url=url, proxies=proxies, timeout=etag_timeout, headers=headers, token=token
                    )

.venv/lib/python3.10...................../site-packages/huggingface_hub/file_download.py:1484: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.10.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.venv/lib/python3.10...................../site-packages/huggingface_hub/file_download.py:1401: in get_hf_file_metadata
    r = _request_wrapper(
.venv/lib/python3.10...................../site-packages/huggingface_hub/file_download.py:285: in _request_wrapper
    response = _request_wrapper(
.venv/lib/python3.10...................../site-packages/huggingface_hub/file_download.py:309: in _request_wrapper
    hf_raise_for_status(response)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        <Tip warning={true}>
    
        Raises when the request has failed:
    
            - [`~utils.RepositoryNotFoundError`]
                If the repository to download from cannot be found. This may be because it
                doesn't exist, because `repo_type` is not set correctly, or because the repo
                is `private` and you do not have access.
            - [`~utils.GatedRepoError`]
                If the repository exists but is gated and the user is not on the authorized
                list.
            - [`~utils.RevisionNotFoundError`]
                If the repository exists but the revision couldn't be find.
            - [`~utils.EntryNotFoundError`]
                If the repository exists but the entry (e.g. the requested file) couldn't be
                find.
            - [`~utils.BadRequestError`]
                If request failed with a HTTP 400 BadRequest error.
            - [`~utils.HfHubHTTPError`]
                If request failed for a reason not listed above.
    
        </Tip>
        """
        try:
            response.raise_for_status()
        except HTTPError as e:
            error_code = response.headers.get("X-Error-Code")
            error_message = response.headers.get("X-Error-Message")
    
            if error_code == "RevisionNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Revision Not Found for url: {response.url}."
                raise _format(RevisionNotFoundError, message, response) from e
    
            elif error_code == "EntryNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Entry Not Found for url: {response.url}."
                raise _format(EntryNotFoundError, message, response) from e
    
            elif error_code == "GatedRepo":
                message = (
                    f"{response.status_code} Client Error." + "\n\n" + f"Cannot access gated repo for url {response.url}."
                )
                raise _format(GatedRepoError, message, response) from e
    
            elif error_message == "Access to this resource is disabled.":
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Cannot access repository for url {response.url}."
                    + "\n"
                    + "Access to this resource is disabled."
                )
                raise _format(DisabledRepoError, message, response) from e
    
            elif error_code == "RepoNotFound" or (
                response.status_code == 401
                and error_message != "Invalid credentials in Authorization header"
                and response.request is not None
                and response.request.url is not None
                and REPO_API_REGEX.search(response.request.url) is not None
            ):
                # 401 is misleading as it is returned for:
                #    - private and gated repos if user is not authenticated
                #    - missing repos
                # => for now, we process them as `RepoNotFound` anyway.
                # See https://gist.github.com/Wauplin/46c27ad266b15998ce56a6603796f0b9
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Repository Not Found for url: {response.url}."
                    + "\nPlease make sure you specified the correct `repo_id` and"
                    " `repo_type`.\nIf you are trying to access a private or gated repo,"
                    " make sure you are authenticated. For more details, see"
                    " https://huggingface..../docs/huggingface_hub/authentication"
                )
                raise _format(RepositoryNotFoundError, message, response) from e
    
            elif response.status_code == 400:
                message = (
                    f"\n\nBad request for {endpoint_name} endpoint:" if endpoint_name is not None else "\n\nBad request:"
                )
                raise _format(BadRequestError, message, response) from e
    
            elif response.status_code == 403:
                message = (
                    f"\n\n{response.status_code} Forbidden: {error_message}."
                    + f"\nCannot access content at: {response.url}."
                    + "\nMake sure your token has the correct permissions."
                )
                raise _format(HfHubHTTPError, message, response) from e
    
            elif response.status_code == 416:
                range_header = response.request.headers.get("Range")
                message = f"{e}. Requested range: {range_header}. Content-Range: {response.headers.get('Content-Range')}."
                raise _format(HfHubHTTPError, message, response) from e
    
            # Convert `HTTPError` into a `HfHubHTTPError` to display request information
            # as well (request id and/or server error message)
>           raise _format(HfHubHTTPError, str(e), response) from e
E           huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.venv/lib/python3.10.../huggingface_hub/utils/_http.py:482: HfHubHTTPError

The above exception was the direct cause of the following exception:

path_or_repo_id = 'minishlab/potion-retrieval-32M', filenames = ['config.json']
cache_dir = '....../home/runner/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.51.0; python/3.10.17; session_id/c713e718ca6b46e5b15cca429bb762e8; torch/2.6.0; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json', file_counter = 0

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`List[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
>               hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )

.venv/lib/python3.10.../transformers/utils/hub.py:424: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.10.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.venv/lib/python3.10...................../site-packages/huggingface_hub/file_download.py:961: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.venv/lib/python3.10...................../site-packages/huggingface_hub/file_download.py:1068: in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

head_call_error = HfHubHTTPError('429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json')
force_download = False, local_files_only = False

    def _raise_on_head_call_error(head_call_error: Exception, force_download: bool, local_files_only: bool) -> NoReturn:
        """Raise an appropriate error when the HEAD call failed and we cannot locate a local file."""
        # No head call => we cannot force download.
        if force_download:
            if local_files_only:
                raise ValueError("Cannot pass 'force_download=True' and 'local_files_only=True' at the same time.")
            elif isinstance(head_call_error, OfflineModeIsEnabled):
                raise ValueError("Cannot pass 'force_download=True' when offline mode is enabled.") from head_call_error
            else:
                raise ValueError("Force download failed due to the above error.") from head_call_error
    
        # No head call + couldn't find an appropriate file on disk => raise an error.
        if local_files_only:
            raise LocalEntryNotFoundError(
                "Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable"
                " hf.co look-ups and downloads online, set 'local_files_only' to False."
            )
        elif isinstance(head_call_error, (RepositoryNotFoundError, GatedRepoError)) or (
            isinstance(head_call_error, HfHubHTTPError) and head_call_error.response.status_code == 401
        ):
            # Repo not found or gated => let's raise the actual error
            # Unauthorized => likely a token issue => let's raise the actual error
            raise head_call_error
        else:
            # Otherwise: most likely a connection issue or Hub downtime => let's warn the user
>           raise LocalEntryNotFoundError(
                "An error happened while trying to locate the file on the Hub and we cannot find the requested files"
                " in the local cache. Please check your connection and try again or make sure your Internet connection"
                " is on."
            ) from head_call_error
E           huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

.venv/lib/python3.10...................../site-packages/huggingface_hub/file_download.py:1599: LocalEntryNotFoundError

The above exception was the direct cause of the following exception:

cls = <class 'chonkie.embeddings.auto.AutoEmbeddings'>
model = 'minishlab/potion-retrieval-32M', kwargs = {}
embeddings_instance = None, embeddings_cls = None
SentenceTransformerEmbeddings = <class 'chonkie.embeddings.sentence_transformer.SentenceTransformerEmbeddings'>

    @classmethod
    def get_embeddings(cls, model: Union[str, BaseEmbeddings, Any], **kwargs: Any) -> BaseEmbeddings:
        """Get embeddings instance based on identifier.
    
        Args:
            model: Identifier for the embeddings (name, path, URL, etc.)
            **kwargs: Additional arguments passed to the embeddings constructor
    
        Returns:
            Initialized embeddings instance
    
        Raises:
            ValueError: If no suitable embeddings implementation is found
    
        Examples:
            # Get sentence transformers embeddings
            embeddings = AutoEmbeddings.get_embeddings("sentence-transformers/all-MiniLM-L6-v2")
    
            # Get OpenAI embeddings
            embeddings = AutoEmbeddings.get_embeddings("openai://text-embedding-ada-002", api_key="...")
    
            # Get Anthropic embeddings
            embeddings = AutoEmbeddings.get_embeddings("anthropic://claude-v1", api_key="...")
    
            # Get Cohere embeddings
            embeddings = AutoEmbeddings.get_embeddings("cohere://embed-english-light-v3.0", api_key="...")
    
        """
        # Load embeddings instance if already provided
        if isinstance(model, BaseEmbeddings):
            return model
        elif isinstance(model, str):
            # Initializing the embedding instance
            embeddings_instance = None
    
            # Check if the user passed in a provider alias
            if "://" in model:
                provider, model_name = model.split("://")
                embeddings_cls = EmbeddingsRegistry.get_provider(provider)
                if embeddings_cls:
                    try:
                        return embeddings_cls(model_name, **kwargs)  # type: ignore
                    except Exception as error:
                        raise ValueError(f"Failed to load {model} with {embeddings_cls.__name__}, with error: {error}")
                else:
                    raise ValueError(f"No provider found for {provider}. Please check the provider name and try again.")
            else:
                # Try to find matching implementation via registry
                embeddings_cls = EmbeddingsRegistry.match(model)
                if embeddings_cls:
                        try:
                            # Try instantiating with the model identifier
                            embeddings_instance = embeddings_cls(model, **kwargs)  # type: ignore
                        except Exception as error:
                            warnings.warn(
                                f"Failed to load {model} with {embeddings_cls.__name__}: {error}\n"
                                f"Falling back to loading default provider model."
                            )
                            try:
                                # Try instantiating with the default provider model without the model identifier
                                embeddings_instance = embeddings_cls(**kwargs)
                            except Exception as error:
                                warnings.warn(
                                    f"Failed to load the default model for {embeddings_cls.__name__}: {error}\n"
                                    f"Falling back to SentenceTransformerEmbeddings."
                                )
    
            # If registry lookup and instantiation succeeded, return the instance
            if embeddings_instance:
                return embeddings_instance
    
            # If registry lookup and instantiation failed, return the default SentenceTransformerEmbeddings
            from .sentence_transformer import SentenceTransformerEmbeddings
            try:
>               return SentenceTransformerEmbeddings(model, **kwargs)

.../chonkie/embeddings/auto.py:107: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../chonkie/embeddings/sentence_transformer.py:49: in __init__
    self.model = SentenceTransformer(self.model_name_or_path, **kwargs)
.venv/lib/python3.10....../site-packages/sentence_transformers/SentenceTransformer.py:321: in __init__
    modules = self._load_auto_model(
.venv/lib/python3.10....../site-packages/sentence_transformers/SentenceTransformer.py:1600: in _load_auto_model
    transformer_model = Transformer(
.venv/lib/python3.10.../sentence_transformers/models/Transformer.py:80: in __init__
    config, is_peft_model = self._load_config(model_name_or_path, cache_dir, backend, config_args)
.venv/lib/python3.10.../sentence_transformers/models/Transformer.py:145: in _load_config
    return AutoConfig.from_pretrained(model_name_or_path, **config_args, cache_dir=cache_dir), False
.venv/lib/python3.10.../models/auto/configuration_auto.py:1112: in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
.venv/lib/python3.10....../site-packages/transformers/configuration_utils.py:590: in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
.venv/lib/python3.10....../site-packages/transformers/configuration_utils.py:649: in _get_config_dict
    resolved_config_file = cached_file(
.venv/lib/python3.10.../transformers/utils/hub.py:266: in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path_or_repo_id = 'minishlab/potion-retrieval-32M', filenames = ['config.json']
cache_dir = '....../home/runner/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.51.0; python/3.10.17; session_id/c713e718ca6b46e5b15cca429bb762e8; torch/2.6.0; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json', file_counter = 0

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`List[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
                hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
            else:
                snapshot_download(
                    path_or_repo_id,
                    allow_patterns=full_filenames,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
    
        except Exception as e:
            # We cannot recover from them
            if isinstance(e, RepositoryNotFoundError) and not isinstance(e, GatedRepoError):
                raise OSError(
                    f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
                    "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to pass a token "
                    "having permission to this repo either by logging in with `huggingface-cli login` or by passing "
                    "`token=<your_token>`"
                ) from e
            elif isinstance(e, RevisionNotFoundError):
                raise OSError(
                    f"{revision} is not a valid git identifier (branch name, tag name or commit id) that exists "
                    "for this model name. Check the model page at "
                    f"'https://huggingface.co/{path_or_repo_id}' for available revisions."
                ) from e
    
            # Now we try to recover if we can find all files correctly in the cache
            resolved_files = [
                _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames
            ]
            if all(file is not None for file in resolved_files):
                return resolved_files
    
            # Raise based on the flags. Note that we will raise for missing entries at the very end, even when
            # not entering this Except block, as it may also happen when `snapshot_
F438
download` does not raise
            if isinstance(e, GatedRepoError):
                if not _raise_exceptions_for_gated_repo:
                    return None
                raise OSError(
                    "You are trying to access a gated repo.\nMake sure to have access to it at "
                    f"https://huggingface.co/{path_or_repo_id}.\n{str(e)}"
                ) from e
            elif isinstance(e, LocalEntryNotFoundError):
                if not _raise_exceptions_for_connection_errors:
                    return None
                # Here we only raise if both flags for missing entry and connection errors are True (because it can be raised
                # even when `local_files_only` is True, in which case raising for connections errors only would not make sense)
                elif _raise_exceptions_for_missing_entries:
>                   raise OSError(
                        f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load the files, and couldn't find them in the"
                        f" cached files.\nCheckout your internet connection or see how to run the library in offline mode at"
                        " 'https://huggingface..../docs/transformers/installation#offline-mode'."
                    ) from e
E                   OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E                   Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.venv/lib/python3.10.../transformers/utils/hub.py:491: OSError

During handling of the above exception, another exception occurred:

    @pytest.fixture(scope="module")
    def real_embeddings() -> BaseEmbeddings:
        """Provide an instance of the actual default embedding model."""
        # Use scope="module" to load the model only once per test module run
        # Set environment variable to potentially avoid Hugging Face Hub login prompts in some CI environments
        os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
>       return AutoEmbeddings.get_embeddings(DEFAULT_EMBEDDING_MODEL)

tests/handshakes/test_qdrant_handshake.py:27: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'chonkie.embeddings.auto.AutoEmbeddings'>
model = 'minishlab/potion-retrieval-32M', kwargs = {}
embeddings_instance = None, embeddings_cls = None
SentenceTransformerEmbeddings = <class 'chonkie.embeddings.sentence_transformer.SentenceTransformerEmbeddings'>

    @classmethod
    def get_embeddings(cls, model: Union[str, BaseEmbeddings, Any], **kwargs: Any) -> BaseEmbeddings:
        """Get embeddings instance based on identifier.
    
        Args:
            model: Identifier for the embeddings (name, path, URL, etc.)
            **kwargs: Additional arguments passed to the embeddings constructor
    
        Returns:
            Initialized embeddings instance
    
        Raises:
            ValueError: If no suitable embeddings implementation is found
    
        Examples:
            # Get sentence transformers embeddings
            embeddings = AutoEmbeddings.get_embeddings("sentence-transformers/all-MiniLM-L6-v2")
    
            # Get OpenAI embeddings
            embeddings = AutoEmbeddings.get_embeddings("openai://text-embedding-ada-002", api_key="...")
    
            # Get Anthropic embeddings
            embeddings = AutoEmbeddings.get_embeddings("anthropic://claude-v1", api_key="...")
    
            # Get Cohere embeddings
            embeddings = AutoEmbeddings.get_embeddings("cohere://embed-english-light-v3.0", api_key="...")
    
        """
        # Load embeddings instance if already provided
        if isinstance(model, BaseEmbeddings):
            return model
        elif isinstance(model, str):
            # Initializing the embedding instance
            embeddings_instance = None
    
            # Check if the user passed in a provider alias
            if "://" in model:
                provider, model_name = model.split("://")
                embeddings_cls = EmbeddingsRegistry.get_provider(provider)
                if embeddings_cls:
                    try:
                        return embeddings_cls(model_name, **kwargs)  # type: ignore
                    except Exception as error:
                        raise ValueError(f"Failed to load {model} with {embeddings_cls.__name__}, with error: {error}")
                else:
                    raise ValueError(f"No provider found for {provider}. Please check the provider name and try again.")
            else:
                # Try to find matching implementation via registry
                embeddings_cls = EmbeddingsRegistry.match(model)
                if embeddings_cls:
                        try:
                            # Try instantiating with the model identifier
                            embeddings_instance = embeddings_cls(model, **kwargs)  # type: ignore
                        except Exception as error:
                            warnings.warn(
                                f"Failed to load {model} with {embeddings_cls.__name__}: {error}\n"
                                f"Falling back to loading default provider model."
                            )
                            try:
                                # Try instantiating with the default provider model without the model identifier
                                embeddings_instance = embeddings_cls(**kwargs)
                            except Exception as error:
                                warnings.warn(
                                    f"Failed to load the default model for {embeddings_cls.__name__}: {error}\n"
                                    f"Falling back to SentenceTransformerEmbeddings."
                                )
    
            # If registry lookup and instantiation succeeded, return the instance
            if embeddings_instance:
                return embeddings_instance
    
            # If registry lookup and instantiation failed, return the default SentenceTransformerEmbeddings
            from .sentence_transformer import SentenceTransformerEmbeddings
            try:
                return SentenceTransformerEmbeddings(model, **kwargs)
            except Exception as e:
>               raise ValueError(f"Failed to load embeddings via SentenceTransformerEmbeddings after registry/fallback failure: {e}")
E               ValueError: Failed to load embeddings via SentenceTransformerEmbeddings after registry/fallback failure: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E               Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.../chonkie/embeddings/auto.py:109: ValueError

tests.handshakes.test_qdrant_handshake::test_qdrant_handshake_init_existing_collection

Stack Traces | 0.001s run time

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        <Tip warning={true}>
    
        Raises when the request has failed:
    
            - [`~utils.RepositoryNotFoundError`]
                If the repository to download from cannot be found. This may be because it
                doesn't exist, because `repo_type` is not set correctly, or because the repo
                is `private` and you do not have access.
            - [`~utils.GatedRepoError`]
                If the repository exists but is gated and the user is not on the authorized
                list.
            - [`~utils.RevisionNotFoundError`]
                If the repository exists but the revision couldn't be find.
            - [`~utils.EntryNotFoundError`]
                If the repository exists but the entry (e.g. the requested file) couldn't be
                find.
            - [`~utils.BadRequestError`]
                If request failed with a HTTP 400 BadRequest error.
            - [`~utils.HfHubHTTPError`]
                If request failed for a reason not listed above.
    
        </Tip>
        """
        try:
> 
EF5E
          response.raise_for_status()

.venv/lib/python3.11.../huggingface_hub/utils/_http.py:409: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Response [429]>

    def raise_for_status(self):
        """Raises :class:`HTTPError`, if one occurred."""
    
        http_error_msg = ""
        if isinstance(self.reason, bytes):
            # We attempt to decode utf-8 first because some servers
            # choose to localize their reason strings. If the string
            # isn't utf-8, we fall back to iso-8859-1 for all other
            # encodings. (See PR #3538)
            try:
                reason = self.reason.decode("utf-8")
            except UnicodeDecodeError:
                reason = self.reason.decode("iso-8859-1")
        else:
            reason = self.reason
    
        if 400 <= self.status_code < 500:
            http_error_msg = (
                f"{self.status_code} Client Error: {reason} for url: {self.url}"
            )
    
        elif 500 <= self.status_code < 600:
            http_error_msg = (
                f"{self.status_code} Server Error: {reason} for url: {self.url}"
            )
    
        if http_error_msg:
>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.venv/lib/python3.11.../site-packages/requests/models.py:1024: HTTPError

The above exception was the direct cause of the following exception:

    def _get_metadata_or_catch_error(
        *,
        repo_id: str,
        filename: str,
        repo_type: str,
        revision: str,
        endpoint: Optional[str],
        proxies: Optional[Dict],
        etag_timeout: Optional[float],
        headers: Dict[str, str],  # mutated inplace!
        token: Union[bool, str, None],
        local_files_only: bool,
        relative_filename: Optional[str] = None,  # only used to store `.no_exists` in cache
        storage_folder: Optional[str] = None,  # only used to store `.no_exists` in cache
    ) -> Union[
        # Either an exception is caught and returned
        Tuple[None, None, None, None, None, Exception],
        # Or the metadata is returned as
        # `(url_to_download, etag, commit_hash, expected_size, xet_file_data, None)`
        Tuple[str, str, str, int, Optional[XetFileData], None],
    ]:
        """Get metadata for a file on the Hub, safely handling network issues.
    
        Returns either the etag, commit_hash and expected size of the file, or the error
        raised while fetching the metadata.
    
        NOTE: This function mutates `headers` inplace! It removes the `authorization` header
              if the file is a LFS blob and the domain of the url is different from the
              domain of the location (typically an S3 bucket).
        """
        if local_files_only:
            return (
                None,
                None,
                None,
                None,
                None,
                OfflineModeIsEnabled(
                    f"Cannot access file since 'local_files_only=True' as been set. (repo_id: {repo_id}, repo_type: {repo_type}, revision: {revision}, filename: {filename})"
                ),
            )
    
        url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
        url_to_download: str = url
        etag: Optional[str] = None
        commit_hash: Optional[str] = None
        expected_size: Optional[int] = None
        head_error_call: Optional[Exception] = None
        xet_file_data: Optional[XetFileData] = None
    
        # Try to get metadata from the server.
        # Do not raise yet if the file is not found or not accessible.
        if not local_files_only:
            try:
                try:
>                   metadata = get_hf_file_metadata(
                        url=url, proxies=proxies, timeout=etag_timeout, headers=headers, token=token
                    )

.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:1484: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.11.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:1401: in get_hf_file_metadata
    r = _request_wrapper(
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:285: in _request_wrapper
    response = _request_wrapper(
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:309: in _request_wrapper
    hf_raise_for_status(response)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        <Tip warning={true}>
    
        Raises when the request has failed:
    
            - [`~utils.RepositoryNotFoundError`]
                If the repository to download from cannot be found. This may be because it
                doesn't exist, because `repo_type` is not set correctly, or because the repo
                is `private` and you do not have access.
            - [`~utils.GatedRepoError`]
                If the repository exists but is gated and the user is not on the authorized
                list.
            - [`~utils.RevisionNotFoundError`]
                If the repository exists but the revision couldn't be find.
            - [`~utils.EntryNotFoundError`]
                If the repository exists but the entry (e.g. the requested file) couldn't be
                find.
            - [`~utils.BadRequestError`]
                If request failed with a HTTP 400 BadRequest error.
            - [`~utils.HfHubHTTPError`]
                If request failed for a reason not listed above.
    
        </Tip>
        """
        try:
            response.raise_for_status()
        except HTTPError as e:
            error_code = response.headers.get("X-Error-Code")
            error_message = response.headers.get("X-Error-Message")
    
            if error_code == "RevisionNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Revision Not Found for url: {response.url}."
                raise _format(RevisionNotFoundError, message, response) from e
    
            elif error_code == "EntryNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Entry Not Found for url: {response.url}."
                raise _format(EntryNotFoundError, message, response) from e
    
            elif error_code == "GatedRepo":
                message = (
                    f"{response.status_code} Client Error." + "\n\n" + f"Cannot access gated repo for url {response.url}."
                )
                raise _format(GatedRepoError, message, response) from e
    
            elif error_message == "Access to this resource is disabled.":
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Cannot access repository for url {response.url}."
                    + "\n"
                    + "Access to this resource is disabled."
                )
                raise _format(DisabledRepoError, message, response) from e
    
            elif error_code == "RepoNotFound" or (
                response.status_code == 401
                and error_message != "Invalid credentials in Authorization header"
                and response.request is not None
                and response.request.url is not None
                and REPO_API_REGEX.search(response.request.url) is not None
            ):
                # 401 is misleading as it is returned for:
                #    - private and gated repos if user is not authenticated
                #    - missing repos
                # => for now, we process them as `RepoNotFound` anyway.
                # See https://gist.github.com/Wauplin/46c27ad266b15998ce56a6603796f0b9
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Repository Not Found for url: {response.url}."
                    + "\nPlease make sure you specified the correct `repo_id` and"
                    " `repo_type`.\nIf you are trying to access a private or gated repo,"
                    " make sure you are authenticated. For more details, see"
                    " https://huggingface..../docs/huggingface_hub/authentication"
                )
                raise _format(RepositoryNotFoundError, message, response) from e
    
            elif response.status_code == 400:
                message = (
                    f"\n\nBad request for {endpoint_name} endpoint:" if endpoint_name is not None else "\n\nBad request:"
                )
                raise _format(BadRequestError, message, response) from e
    
            elif response.status_code == 403:
                message = (
                    f"\n\n{response.status_code} Forbidden: {error_message}."
                    + f"\nCannot access content at: {response.url}."
                    + "\nMake sure your token has the correct permissions."
                )
                raise _format(HfHubHTTPError, message, response) from e
    
            elif response.status_code == 416:
                range_header = response.request.headers.get("Range")
                message = f"{e}. Requested range: {range_header}. Content-Range: {response.headers.get('Content-Range')}."
                raise _format(HfHubHTTPError, message, response) from e
    
            # Convert `HTTPError` into a `HfHubHTTPError` to display request information
            # as well (request id and/or server error message)
>           raise _format(HfHubHTTPError, str(e), response) from e
E           huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.venv/lib/python3.11.../huggingface_hub/utils/_http.py:482: HfHubHTTPError

The above exception was the direct cause of the following exception:

path_or_repo_id = 'minishlab/potion-retrieval-32M', filenames = ['config.json']
cache_dir = '....../home/runner/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.51.0; python/3.11.12; session_id/ee20371638994a52ad613f353cdbc8c0; torch/2.6.0; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json', file_counter = 0

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`List[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
>               hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )

.venv/lib/python3.11.../transformers/utils/hub.py:424: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.11.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:961: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:1068: in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

head_call_error = HfHubHTTPError('429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json')
force_download = False, local_files_only = False

    def _raise_on_head_call_error(head_call_error: Exception, force_download: bool, local_files_only: bool) -> NoReturn:
        """Raise an appropriate error when the HEAD call failed and we cannot locate a local file."""
        # No head call => we cannot force download.
        if force_download:
            if local_files_only:
                raise ValueError("Cannot pass 'force_download=True' and 'local_files_only=True' at the same time.")
            elif isinstance(head_call_error, OfflineModeIsEnabled):
                raise ValueError("Cannot pass 'force_download=True' when offline mode is enabled.") from head_call_error
            else:
                raise ValueError("Force download failed due to the above error.") from head_call_error
    
        # No head call + couldn't find an appropriate file on disk => raise an error.
        if local_files_only:
            raise LocalEntryNotFoundError(
                "Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable"
                " hf.co look-ups and downloads online, set 'local_files_only' to False."
            )
        elif isinstance(head_call_error, (RepositoryNotFoundError, GatedRepoError)) or (
            isinstance(head_call_error, HfHubHTTPError) and head_call_error.response.status_code == 401
        ):
            # Repo not found or gated => let's raise the actual error
            # Unauthorized => likely a token issue => let's raise the actual error
            raise head_call_error
        else:
            # Otherwise: most likely a connection issue or Hub downtime => let's warn the user
>           raise LocalEntryNotFoundError(
                "An error happened while trying to locate the file on the Hub and we cannot find the requested files"
                " in the local cache. Please check your connection and try again or make sure your Internet connection"
                " is on."
            ) from head_call_error
E           huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:1599: LocalEntryNotFoundError

The above exception was the direct cause of the following exception:

cls = <class 'chonkie.embeddings.auto.AutoEmbeddings'>
model = 'minishlab/potion-retrieval-32M', kwargs = {}
embeddings_instance = None, embeddings_cls = None
SentenceTransformerEmbeddings = <class 'chonkie.embeddings.sentence_transformer.SentenceTransformerEmbeddings'>

    @classmethod
    def get_embeddings(cls, model: Union[str, BaseEmbeddings, Any], **kwargs: Any) -> BaseEmbeddings:
        """Get embeddings instance based on identifier.
    
        Args:
            model: Identifier for the embeddings (name, path, URL, etc.)
            **kwargs: Additional arguments passed to the embeddings constructor
    
        Returns:
            Initialized embeddings instance
    
        Raises:
            ValueError: If no suitable embeddings implementation is found
    
        Examples:
            # Get sentence transformers embeddings
            embeddings = AutoEmbeddings.get_embeddings("sentence-transformers/all-MiniLM-L6-v2")
    
            # Get OpenAI embeddings
            embeddings = AutoEmbeddings.get_embeddings("openai://text-embedding-ada-002", api_key="...")
    
            # Get Anthropic embeddings
            embeddings = AutoEmbeddings.get_embeddings("anthropic://claude-v1", api_key="...")
    
            # Get Cohere embeddings
            embeddings = AutoEmbeddings.get_embeddings("cohere://embed-english-light-v3.0", api_key="...")
    
        """
        # Load embeddings instance if already provided
        if isinstance(model, BaseEmbeddings):
            return model
        elif isinstance(model, str):
            # Initializing the embedding instance
            embeddings_instance = None
    
            # Check if the user passed in a provider alias
            if "://" in model:
                provider, model_name = model.split("://")
                embeddings_cls = EmbeddingsRegistry.get_provider(provider)
                if embeddings_cls:
                    try:
                        return embeddings_cls(model_name, **kwargs)  # type: ignore
                    except Exception as error:
                        raise ValueError(f"Failed to load {model} with {embeddings_cls.__name__}, with error: {error}")
                else:
                    raise ValueError(f"No provider found for {provider}. Please check the provider name and try again.")
            else:
                # Try to find matching implementation via registry
                embeddings_cls = EmbeddingsRegistry.match(model)
                if embeddings_cls:
                        try:
                            # Try instantiating with the model identifier
                            embeddings_instance = embeddings_cls(model, **kwargs)  # type: ignore
                        except Exception as error:
                            warnings.warn(
                                f"Failed to load {model} with {embeddings_cls.__name__}: {error}\n"
                                f"Falling back to loading default provider model."
                            )
                            try:
                                # Try instantiating with the default provider model without the model identifier
                                embeddings_instance = embeddings_cls(**kwargs)
                            except Exception as error:
                                warnings.warn(
                                    f"Failed to load the default model for {embeddings_cls.__name__}: {error}\n"
                                    f"Falling back to SentenceTransformerEmbeddings."
                                )
    
            # If registry lookup and instantiation succeeded, return the instance
            if embeddings_instance:
                return embeddings_instance
    
            # If registry lookup and instantiation failed, return the default SentenceTransformerEmbeddings
            from .sentence_transformer import SentenceTransformerEmbeddings
            try:
>               return SentenceTransformerEmbeddings(model, **kwargs)

.../chonkie/embeddings/auto.py:107: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../chonkie/embeddings/sentence_transformer.py:49: in __init__
    self.model = SentenceTransformer(self.model_name_or_path, **kwargs)
.venv/lib/python3.11....../site-packages/sentence_transformers/SentenceTransformer.py:321: in __init__
    modules = self._load_auto_model(
.venv/lib/python3.11....../site-packages/sentence_transformers/SentenceTransformer.py:1600: in _load_auto_model
    transformer_model = Transformer(
.venv/lib/python3.11.../sentence_transformers/models/Transformer.py:80: in __init__
    config, is_peft_model = self._load_config(model_name_or_path, cache_dir, backend, config_args)
.venv/lib/python3.11.../sentence_transformers/models/Transformer.py:145: in _load_config
    return AutoConfig.from_pretrained(model_name_or_path, **config_args, cache_dir=cache_dir), False
.venv/lib/python3.11.../models/auto/configuration_auto.py:1112: in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
.venv/lib/python3.11....../site-packages/transformers/configuration_utils.py:590: in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
.venv/lib/python3.11....../site-packages/transformers/configuration_utils.py:649: in _get_config_dict
    resolved_config_file = cached_file(
.venv/lib/python3.11.../transformers/utils/hub.py:266: in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path_or_repo_id = 'minishlab/potion-retrieval-32M', filenames = ['config.json']
cache_dir = '....../home/runner/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.51.0; python/3.11.12; session_id/ee20371638994a52ad613f353cdbc8c0; torch/2.6.0; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json', file_counter = 0

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`List[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
                hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
            else:
                snapshot_download(
                    path_or_repo_id,
                    allow_patterns=full_filenames,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
    
        except Exception as e:
            # We cannot recover from them
            if isinstance(e, RepositoryNotFoundError) and not isinstance(e, GatedRepoError):
                raise OSError(
                    f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
                    "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to pass a token "
                    "having permission to this repo either by logging in with `huggingface-cli login` or by passing "
                    "`token=<your_token>`"
                ) from e
            elif isinstance(e, RevisionNotFoundError):
                raise OSError(
                    f"{revision} is not a valid git identifier (branch name, tag name or commit id) that exists "
                    "for this model name. Check the model page at "
                    f"'https://huggingface.co/{path_or_repo_id}' for available revisions."
                ) from e
    
            # Now we try to recover if we can find all files correctly in the cache
            resolved_files = [
                _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames
            ]
            if all(file is not None for file in resolved_files):
                return resolved_files
    
            # Raise based on the flags. Note that we will raise for missing entries at the very end, even when
            # not entering this Except block, as it may also happen when `snapshot_download` does not raise
            if isinstance(e, GatedRepoError):
                if not _raise_exceptions_for_gated_repo:
                    return None
                raise OSError(
                    "You are trying to access a gated repo.\nMake sure to have access to it at "
                    f"https://huggingface.co/{path_or_repo_id}.\n{str(e)}"
                ) from e
            elif isinstance(e, LocalEntryNotFoundError):
                if not _raise_exceptions_for_connection_errors:
                    return None
                # Here we only raise if both flags for missing entry and connection errors are True (because it can be raised
                # even when `local_files_only` is True, in which case raising for connections errors only would not make sense)
                elif _raise_exceptions_for_missing_entries:
>                   raise OSError(
                        f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load the files, and couldn't find them in the"
                        f" cached files.\nCheckout your internet connection or see how to run the library in offline mode at"
                        " 'https://huggingface..../docs/transformers/installation#offline-mode'."
                    ) from e
E                   OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E                   Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.venv/lib/python3.11.../transformers/utils/hub.py:491: OSError

During handling of the above exception, another exception occurred:

    @pytest.fixture(scope="module")
    def real_embeddings() -> BaseEmbeddings:
        """Provide an instance of the actual default embedding model."""
        # Use scope="module" to load the model only once per test module run
        # Set environment variable to potentially avoid Hugging Face Hub login prompts in some CI environments
        os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
>       return AutoEmbeddings.get_embeddings(DEFAULT_EMBEDDING_MODEL)

tests/handshakes/test_qdrant_handshake.py:27: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'chonkie.embeddings.auto.AutoEmbeddings'>
model = 'minishlab/potion-retrieval-32M', kwargs = {}
embeddings_instance = None, embeddings_cls = None
SentenceTransformerEmbeddings = <class 'chonkie.embeddings.sentence_transformer.SentenceTransformerEmbeddings'>

    @classmethod
    def get_embeddings(cls, model: Union[str, BaseEmbeddings, Any], **kwargs: Any) -> BaseEmbeddings:
        """Get embeddings instance based on identifier.
    
        Args:
            model: Identifier for the embeddings (name, path, URL, etc.)
            **kwargs: Additional arguments passed to the embeddings constructor
    
        Returns:
            Initialized embeddings instance
    
        Raises:
            ValueError: If no suitable embeddings implementation is found
    
        Examples:
            # Get sentence transformers embeddings
            embeddings = AutoEmbeddings.get_embeddings("sentence-transformers/all-MiniLM-L6-v2")
    
            # Get OpenAI embeddings
            embeddings = AutoEmbeddings.get_embeddings("openai://text-embedding-ada-002", api_key="...")
    
            # Get Anthropic embeddings
            embeddings = AutoEmbeddings.get_embeddings("anthropic://claude-v1", api_key="...")
    
            # Get Cohere embeddings
            embeddings = AutoEmbeddings.get_embeddings("cohere://embed-english-light-v3.0", api_key="...")
    
        """
        # Load embeddings instance if already provided
        if isinstance(model, BaseEmbeddings):
            return model
        elif isinstance(model, str):
            # Initializing the embedding instance
            embeddings_instance = None
    
            # Check if the user passed in a provider alias
            if "://" in model:
                provider, model_name = model.split("://")
                embeddings_cls = EmbeddingsRegistry.get_provider(provider)
                if embeddings_cls:
                    try:
                        return embeddings_cls(model_name, **kwargs)  # type: ignore
                    except Exception as error:
                        raise ValueError(f"Failed to load {model} with {embeddings_cls.__name__}, with error: {error}")
                else:
                    raise ValueError(f"No provider found for {provider}. Please check the provider name and try again.")
            else:
                # Try to find matching implementation via registry
                embeddings_cls = EmbeddingsRegistry.match(model)
                if embeddings_cls:
                        try:
                            # Try instantiating with the model identifier
                            embeddings_instance = embeddings_cls(model, **kwargs)  # type: ignore
                        except Exception as error:
                            warnings.warn(
                                f"Failed to load {model} with {embeddings_cls.__name__}: {error}\n"
                                f"Falling back to loading default provider model."
                            )
                            try:
                                # Try instantiating with the default provider model without the model identifier
                                embeddings_instance = embeddings_cls(**kwargs)
                            except Exception as error:
                                warnings.warn(
                                    f"Failed to load the default model for {embeddings_cls.__name__}: {error}\n"
                                    f"Falling back to SentenceTransformerEmbeddings."
                                )
    
            # If registry lookup and instantiation succeeded, return the instance
            if embeddings_instance:
                return embeddings_instance
    
            # If registry lookup and instantiation failed, return the default SentenceTransformerEmbeddings
            from .sentence_transformer import SentenceTransformerEmbeddings
            try:
                return SentenceTransformerEmbeddings(model, **kwargs)
            except Exception as e:
>               raise ValueError(f"Failed to load embeddings via SentenceTransformerEmbeddings after registry/fallback failure: {e}")
E               ValueError: Failed to load embeddings via SentenceTransformerEmbeddings after registry/fallback failure: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E               Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.../chonkie/embeddings/auto.py:109: ValueError

tests.handshakes.test_qdrant_handshake::test_qdrant_handshake_write_multiple_chunks

Stack Traces | 0.001s run time

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        <Tip warning={true}>
    
        Raises when the request has failed:
    
            - [`~utils.RepositoryNotFoundError`]
                If the repository to download from cannot be found. This may be because it
                doesn't exist, because `repo_type` is not set correctly, or because the repo
                is `private` and you do not have access.
            - [`~utils.GatedRepoError`]
                If the repository exists but is gated and the user is not on the authorized
                list.
            - [`~utils.RevisionNotFoundError`]
                If the repository exists but the revision couldn't be find.
            - [`~utils.EntryNotFoundError`]
                If the repository exists but the entry (e.g. the requested file) couldn't be
                find.
            - [`~utils.BadRequestError`]
                If request failed with a HTTP 400 BadRequest error.
            - [`~utils.HfHubHTTPError`]
                If request failed for a reason not listed above.
    
        </Tip>
        """
        try:
>           response.raise_for_status()

.venv/lib/python3.11.../huggingface_hub/utils/_http.py:409: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Response [429]>

    def raise_for_status(self):
        """Raises :class:`HTTPError`, if one occurred."""
    
        http_error_msg = ""
        if isinstance(self.reason, bytes):
            # We attempt to decode utf-8 first because some servers
            # choose to localize their reason strings. If the string
            # isn't utf-8, we fall back to iso-8859-1 for all other
            # encodings. (See PR #3538)
            try:
                reason = self.reason.decode("utf-8")
            except UnicodeDecodeError:
                reason = self.reason.decode("iso-8859-1")
        else:
            reason = self.reason
    
        if 400 <= self.status_code < 500:
            http_error_msg = (
                f"{self.status_code} Client Error: {reason} for url: {self.url}"
            )
    
        elif 500 <= self.status_code < 600:
            http_error_msg = (
                f"{self.status_code} Server Error: {reason} for url: {self.url}"
            )
    
        if http_error_msg:
>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.venv/lib/python3.11.../site-packages/requests/models.py:1024: HTTPError

The above exception was the direct cause of the following exception:

    def _get_metadata_or_catch_error(
        *,
        repo_id: str,
        filename: str,
        repo_type: str,
        revision: str,
        endpoint: Optional[str],
        proxies: Optional[Dict],
        etag_timeout: Optional[float],
        headers: Dict[str, str],  # mutated inplace!
        token: Union[bool, str, None],
        local_files_only: bool,
        relative_filename: Optional[str] = None,  # only used to store `.no_exists` in cache
        storage_folder: Optional[str] = None,  # only used to store `.no_exists` in cache
    ) -> Union[
        # Either an exception is caught and returned
        Tuple[None, None, None, None, None, Exception],
        # Or the metadata is returned as
        # `(url_to_download, etag, commit_hash, expected_size, xet_file_data, None)`
        Tuple[str, str, str, int, Optional[XetFileData], None],
    ]:
        """Get metadata for a file on the Hub, safely handling network issues.
    
        Returns either the etag, commit_hash and expected size of the file, or the error
        raised while fetching the metadata.
    
        NOTE: This function mutates `headers` inplace! It removes the `authorization` header
              if the file is a LFS blob and the domain of the url is different from the
              domain of the location (typically an S3 bucket).
        """
        if local_files_only:
            return (
                None,
                None,
                None,
                None,
                None,
                OfflineModeIsEnabled(
                    f"Cannot access file since 'local_files_only=True' as been set. (repo_id: {repo_id}, repo_type: {repo_type}, revision: {revision}, filename: {filename})"
                ),
            )
    
        url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
        url_to_download: str = url
        etag: Optional[str] = None
        commit_hash: Optional[str] = None
        expected_size: Optional[int] = None
        head_error_call: Optional[Exception] = None
        xet_file_data: Optional[XetFileData] = None
    
        # Try to get metadata from the server.
        # Do not raise yet if the file is not found or not accessible.
        if not local_files_only:
            try:
                try:
>                   metadata = get_hf_file_metadata(
                        url=url, proxies=proxies, timeout=etag_timeout, headers=headers, token=token
                    )

.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:1484: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.11.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:1401: in get_hf_file_metadata
    r = _request_wrapper(
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:285: in _request_wrapper
    response = _request_wrapper(
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:309: in _request_wrapper
    hf_raise_for_status(response)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        <Tip warning={true}>
    
        Raises when the request has failed:
    
            - [`~utils.RepositoryNotFoundError`]
                If the repository to download from cannot be found. This may be because it
                doesn't exist, because `repo_type` is not set correctly, or because the repo
                is `private` and you do not have access.
            - [`~utils.GatedRepoError`]
                If the repository exists but is gated and the user is not on the authorized
                list.
            - [`~utils.RevisionNotFoundError`]
                If the repository exists but the revision couldn't be find.
            - [`~utils.EntryNotFoundError`]
                If the repository exists but the entry (e.g. the requested file) couldn't be
                find.
            - [`~utils.BadRequestError`]
                If request failed with a HTTP 400 BadRequest error.
            - [`~utils.HfHubHTTPError`]
                If request failed for a reason not listed above.
    
        </Tip>
        """
        try:
            response.raise_for_status()
        except HTTPError as e:
            error_code = response.headers.get("X-Error-Code")
            error_message = response.headers.get("X-Error-Message")
    
            if error_code == "RevisionNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Revision Not Found for url: {response.url}."
                raise _format(RevisionNotFoundError, message, response) from e
    
            elif error_code == "EntryNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Entry Not Found for url: {response.url}."
                raise _format(EntryNotFoundError, message, response) from e
    
            elif error_code == "GatedRepo":
                message = (
                    f"{response.status_code} Client Error." + "\n\n" + f"Cannot access gated repo for url {response.url}."
                )
                raise _format(GatedRepoError, message, response) from e
    
            elif error_message == "Access to this resource is disabled.":
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Cannot access repository for url {response.url}."
                    + "\n"
                    + "Access to this resource is disabled."
                )
                raise _format(DisabledRepoError, message, response) from e
    
            elif error_code == "RepoNotFound" or (
                response.status_code == 401
                and error_message != "Invalid credentials in Authorization header"
                and response.request is not None
                and response.request.url is not None
                and REPO_API_REGEX.search(response.request.url) is not None
            ):
                # 401 is misleading as it is returned for:
                #    - private and gated repos if user is not authenticated
                #    - missing repos
                # => for now, we process them as `RepoNotFound` anyway.
                # See https://gist.github.com/Wauplin/46c27ad266b15998ce56a6603796f0b9
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Repository Not Found for url: {response.url}."
                    + "\nPlease make sure you specified the correct `repo_id` and"
                    " `repo_type`.\nIf you are trying to access a private or gated repo,"
                    " make sure you are authenticated. For more details, see"
                    " https://huggingface..../docs/huggingface_hub/authentication"
                )
                raise _format(RepositoryNotFoundError, message, response) from e
    
            elif response.status_code == 400:
                message = (
                    f"\n\nBad request for {endpoint_name} endpoint:" if endpoint_name is not None else "\n\nBad request:"
                )
                raise _format(BadRequestError, message, response) from e
    
            elif response.status_code == 403:
                message = (
                    f"\n\n{response.status_code} Forbidden: {error_message}."
                    + f"\nCannot access content at: {response.url}."
                    + "\nMake sure your token has the correct permissions."
                )
                raise _format(HfHubHTTPError, message, response) from e
    
            elif response.status_code == 416:
                range_header = response.request.headers.get("Range")
                message = f"{e}. Requested range: {range_header}. Content-Range: {response.headers.get('Content-Range')}."
                raise _format(HfHubHTTPError, message, response) from e
    
            # Convert `HTTPError` into a `HfHubHTTPError` to display request information
            # as well (request id and/or server error message)
>           raise _format(HfHubHTTPError, str(e), response) from e
E           huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.venv/lib/python3.11.../huggingface_hub/utils/_http.py:482: HfHubHTTPError

The above exception was the direct cause of the following exception:

path_or_repo_id = 'minishlab/potion-retrieval-32M', filenames = ['config.json']
cache_dir = '....../home/runner/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.51.0; python/3.11.12; session_id/ee20371638994a52ad613f353cdbc8c0; torch/2.6.0; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json', file_counter = 0

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`List[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
>               hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )

.venv/lib/python3.11.../transformers/utils/hub.py:424: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.11.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:961: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:1068: in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

head_call_error = HfHubHTTPError('429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json')
force_download = False, local_files_only = False

    def _raise_on_head_call_error(head_call_error: Exception, force_download: bool, local_files_only: bool) -> NoReturn:
        """Raise an appropriate error when the HEAD call failed and we cannot locate a local file."""
        # No head call => we cannot force download.
        if force_download:
            if local_files_only:
                raise ValueError("Cannot pass 'force_download=True' and 'local_files_only=True' at the same time.")
            elif isinstance(head_call_error, OfflineModeIsEnabled):
                raise ValueError("Cannot pass 'force_download=True' when offline mode is enabled.") from head_call_error
            else:
                raise ValueError("Force download failed due to the above error.") from head_call_error
    
        # No head call + couldn't find an appropriate file on disk => raise an error.
        if local_files_only:
            raise LocalEntryNotFoundError(
                "Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable"
                " hf.co look-ups and downloads online, set 'local_files_only' to False."
            )
        elif isinstance(head_call_error, (RepositoryNotFoundError, GatedRepoError)) or (
            isinstance(head_call_error, HfHubHTTPError) and head_call_error.response.status_code == 401
        ):
            # Repo not found or gated => let's raise the actual error
            # Unauthorized => likely a token issue => let's raise the actual error
            raise head_call_error
        else:
            # Otherwise: most likely a connection issue or Hub downtime => let's warn the user
>           raise LocalEntryNotFoundError(
                "An error happened while trying to locate the file on the Hub and we cannot find the requested files"
                " in the local cache. Please check your connection and try again or make sure your Internet connection"
                " is on."
            ) from head_call_error
E           huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

.venv/lib/python3.11...................../site-packages/huggingface_hub/file_download.py:1599: LocalEntryNotFoundError

The above exception was the direct cause of the following exception:

cls = <class 'chonkie.embeddings.auto.AutoEmbeddings'>
model = 'minishlab/potion-retrieval-32M', kwargs = {}
embeddings_instance = None, embeddings_cls = None
SentenceTransformerEmbeddings = <class 'chonkie.embeddings.sentence_transformer.SentenceTransformerEmbeddings'>

    @classmethod
    def get_embeddings(cls, model: Union[str, BaseEmbeddings, Any], **kwargs: Any) -> BaseEmbeddings:
        """Get embeddings instance based on identifier.
    
        Args:
            model: Identifier for the embeddings (name, path, URL, etc.)
            **kwargs: Additional arguments passed to the embeddings constructor
    
        Returns:
            Initialized embeddings instance
    
        Raises:
            ValueError: If no suitable embeddings implementation is found
    
        Examples:
            # Get sentence transformers embeddings
            embeddings = AutoEmbeddings.get_embeddings("sentence-transformers/all-MiniLM-L6-v2")
    
            # Get OpenAI embeddings
            embeddings = AutoEmbeddings.get_embeddings("openai://text-embedding-ada-002", api_key="...")
    
            # Get Anthropic embeddings
            embeddings = AutoEmbeddings.get_embeddings("anthropic://claude-v1", api_key="...")
    
            # Get Cohere embeddings
            embeddings = AutoEmbeddings.get_embeddings("cohere://embed-english-light-v3.0", api_key="...")
    
        """
        # Load embeddings instance if already provided
        if isinstance(model, BaseEmbeddings):
            return model
        elif isinstance(model, str):
            # Initializing the embedding instance
            embeddings_instance = None
    
            # Check if the user passed in a provider alias
            if "://" in model:
                provider, model_name = model.split("://")
                embeddings_cls = EmbeddingsRegistry.get_provider(provider)
                if embeddings_cls:
                    try:
                        return embeddings_cls(model_name, **kwargs)  # type: ignore
                    except Exception as error:
                        raise ValueError(f"Failed to load {model} with {embeddings_cls.__name__}, with error: {error}")
                else:
                    raise ValueError(f"No provider found for {provider}. Please check the provider name and try again.")
            else:
                # Try to find matching implementation via registry
                embeddings_cls = EmbeddingsRegistry.match(model)
                if embeddings_cls:
                        try:
                            # Try instantiating with the model identifier
                            embeddings_instance = embeddings_cls(model, **kwargs)  # type: ignore
                        except Exception as error:
                            warnings.warn(
                                f"Failed to load {model} with {embeddings_cls.__name__}: {error}\n"
                                f"Falling back to loading default provider model."
                            )
              
10000
              try:
                                # Try instantiating with the default provider model without the model identifier
                                embeddings_instance = embeddings_cls(**kwargs)
                            except Exception as error:
                                warnings.warn(
                                    f"Failed to load the default model for {embeddings_cls.__name__}: {error}\n"
                                    f"Falling back to SentenceTransformerEmbeddings."
                                )
    
            # If registry lookup and instantiation succeeded, return the instance
            if embeddings_instance:
                return embeddings_instance
    
            # If registry lookup and instantiation failed, return the default SentenceTransformerEmbeddings
            from .sentence_transformer import SentenceTransformerEmbeddings
            try:
>               return SentenceTransformerEmbeddings(model, **kwargs)

.../chonkie/embeddings/auto.py:107: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../chonkie/embeddings/sentence_transformer.py:49: in __init__
    self.model = SentenceTransformer(self.model_name_or_path, **kwargs)
.venv/lib/python3.11....../site-packages/sentence_transformers/SentenceTransformer.py:321: in __init__
    modules = self._load_auto_model(
.venv/lib/python3.11....../site-packages/sentence_transformers/SentenceTransformer.py:1600: in _load_auto_model
    transformer_model = Transformer(
.venv/lib/python3.11.../sentence_transformers/models/Transformer.py:80: in __init__
    config, is_peft_model = self._load_config(model_name_or_path, cache_dir, backend, config_args)
.venv/lib/python3.11.../sentence_transformers/models/Transformer.py:145: in _load_config
    return AutoConfig.from_pretrained(model_name_or_path, **config_args, cache_dir=cache_dir), False
.venv/lib/python3.11.../models/auto/configuration_auto.py:1112: in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
.venv/lib/python3.11....../site-packages/transformers/configuration_utils.py:590: in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
.venv/lib/python3.11....../site-packages/transformers/configuration_utils.py:649: in _get_config_dict
    resolved_config_file = cached_file(
.venv/lib/python3.11.../transformers/utils/hub.py:266: in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path_or_repo_id = 'minishlab/potion-retrieval-32M', filenames = ['config.json']
cache_dir = '....../home/runner/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.51.0; python/3.11.12; session_id/ee20371638994a52ad613f353cdbc8c0; torch/2.6.0; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json', file_counter = 0

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`List[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`Dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
                hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
            else:
                snapshot_download(
                    path_or_repo_id,
                    allow_patterns=full_filenames,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
    
        except Exception as e:
            # We cannot recover from them
            if isinstance(e, RepositoryNotFoundError) and not isinstance(e, GatedRepoError):
                raise OSError(
                    f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
                    "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to pass a token "
                    "having permission to this repo either by logging in with `huggingface-cli login` or by passing "
                    "`token=<your_token>`"
                ) from e
            elif isinstance(e, RevisionNotFoundError):
                raise OSError(
                    f"{revision} is not a valid git identifier (branch name, tag name or commit id) that exists "
                    "for this model name. Check the model page at "
                    f"'https://huggingface.co/{path_or_repo_id}' for available revisions."
                ) from e
    
            # Now we try to recover if we can find all files correctly in the cache
            resolved_files = [
                _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames
            ]
            if all(file is not None for file in resolved_files):
                return resolved_files
    
            # Raise based on the flags. Note that we will raise for missing entries at the very end, even when
            # not entering this Except block, as it may also happen when `snapshot_download` does not raise
            if isinstance(e, GatedRepoError):
                if not _raise_exceptions_for_gated_repo:
                    return None
                raise OSError(
                    "You are trying to access a gated repo.\nMake sure to have access to it at "
                    f"https://huggingface.co/{path_or_repo_id}.\n{str(e)}"
                ) from e
            elif isinstance(e, LocalEntryNotFoundError):
                if not _raise_exceptions_for_connection_errors:
                    return None
                # Here we only raise if both flags for missing entry and connection errors are True (because it can be raised
                # even when `local_files_only` is True, in which case raising for connections errors only would not make sense)
                elif _raise_exceptions_for_missing_entries:
>                   raise OSError(
                        f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load the files, and couldn't find them in the"
                        f" cached files.\nCheckout your internet connection or see how to run the library in offline mode at"
                        " 'https://huggingface..../docs/transformers/installation#offline-mode'."
                    ) from e
E                   OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E                   Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.venv/lib/python3.11.../transformers/utils/hub.py:491: OSError

During handling of the above exception, another exception occurred:

    @pytest.fixture(scope="module")
    def real_embeddings() -> BaseEmbeddings:
        """Provide an instance of the actual default embedding model."""
        # Use scope="module" to load the model only once per test module run
        # Set environment variable to potentially avoid Hugging Face Hub login prompts in some CI environments
        os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
>       return AutoEmbeddings.get_embeddings(DEFAULT_EMBEDDING_MODEL)

tests/handshakes/test_qdrant_handshake.py:27: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'chonkie.embeddings.auto.AutoEmbeddings'>
model = 'minishlab/potion-retrieval-32M', kwargs = {}
embeddings_instance = None, embeddings_cls = None
SentenceTransformerEmbeddings = <class 'chonkie.embeddings.sentence_transformer.SentenceTransformerEmbeddings'>

    @classmethod
    def get_embeddings(cls, model: Union[str, BaseEmbeddings, Any], **kwargs: Any) -> BaseEmbeddings:
        """Get embeddings instance based on identifier.
    
        Args:
            model: Identifier for the embeddings (name, path, URL, etc.)
            **kwargs: Additional arguments passed to the embeddings constructor
    
        Returns:
            Initialized embeddings instance
    
        Raises:
            ValueError: If no suitable embeddings implementation is found
    
        Examples:
            # Get sentence transformers embeddings
            embeddings = AutoEmbeddings.get_embeddings("sentence-transformers/all-MiniLM-L6-v2")
    
            # Get OpenAI embeddings
            embeddings = AutoEmbeddings.get_embeddings("openai://text-embedding-ada-002", api_key="...")
    
            # Get Anthropic embeddings
            embeddings = AutoEmbeddings.get_embeddings("anthropic://claude-v1", api_key="...")
    
            # Get Cohere embeddings
            embeddings = AutoEmbeddings.get_embeddings("cohere://embed-english-light-v3.0", api_key="...")
    
        """
        # Load embeddings instance if already provided
        if isinstance(model, BaseEmbeddings):
            return model
        elif isinstance(model, str):
            # Initializing the embedding instance
            embeddings_instance = None
    
            # Check if the user passed in a provider alias
            if "://" in model:
                provider, model_name = model.split("://")
                embeddings_cls = EmbeddingsRegistry.get_provider(provider)
                if embeddings_cls:
                    try:
                        return embeddings_cls(model_name, **kwargs)  # type: ignore
                    except Exception as error:
                        raise ValueError(f"Failed to load {model} with {embeddings_cls.__name__}, with error: {error}")
                else:
                    raise ValueError(f"No provider found for {provider}. Please check the provider name and try again.")
            else:
                # Try to find matching implementation via registry
                embeddings_cls = EmbeddingsRegistry.match(model)
                if embeddings_cls:
                        try:
                            # Try instantiating with the model identifier
                            embeddings_instance = embeddings_cls(model, **kwargs)  # type: ignore
                        except Exception as error:
                            warnings.warn(
                                f"Failed to load {model} with {embeddings_cls.__name__}: {error}\n"
                                f"Falling back to loading default provider model."
                            )
                            try:
                                # Try instantiating with the default provider model without the model identifier
                                embeddings_instance = embeddings_cls(**kwargs)
                            except Exception as error:
                                warnings.warn(
                                    f"Failed to load the default model for {embeddings_cls.__name__}: {error}\n"
                                    f"Falling back to SentenceTransformerEmbeddings."
                                )
    
            # If registry lookup and instantiation succeeded, return the instance
            if embeddings_instance:
                return embeddings_instance
    
            # If registry lookup and instantiation failed, return the default SentenceTransformerEmbeddings
            from .sentence_transformer import SentenceTransformerEmbeddings
            try:
                return SentenceTransformerEmbeddings(model, **kwargs)
            except Exception as e:
>               raise ValueError(f"Failed to load embeddings via SentenceTransformerEmbeddings after registry/fallback failure: {e}")
E               ValueError: Failed to load embeddings via SentenceTransformerEmbeddings after registry/fallback failure: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E               Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.../chonkie/embeddings/auto.py:109: ValueError

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

- Introduced a new test suite for the JSONPorter class, covering initialization, export to JSON and JSONL formats, and handling of empty chunk lists. - Validated chunk serialization, context inclusion, and indentation in exported files. - Implemented tests for large chunk lists and Unicode content handling. - Ensured proper error handling for file permission issues and support for Path objects.

… annotations - Added return type annotations to all test functions for improved clarity and type checking. - Updated the temp_dir fixture to specify its return type as a Generator.

- Restructured test cases into classes for better organization and clarity. - Added tests for Model2Vec and SentenceTransformer embeddings, including actual embedding generation. - Implemented provider prefix tests for OpenAI, Cohere, VoyageAI, and Jina embeddings. - Enhanced error handling tests for invalid provider prefixes and model identifiers. - Included tests for handling existing embeddings instances and custom embeddings objects.

- Restructured test cases into classes for improved organization and clarity. - Added tests for initialization with explicit and environment API keys, including error handling for missing keys. - Implemented tests for custom model initialization and tokenizer handling. - Enhanced tests for embedding methods, including single and batch embeddings with mocked API responses. - Validated similarity calculations and error handling for various edge cases.

- Restructured test cases into classes for improved organization and clarity. - Added tests for initialization with default and custom models, including error handling for invalid models and missing API keys. - Enhanced tests for embedding methods, including synchronous and asynchronous embedding with mocked API responses. - Implemented tests for token counting, dimension properties, and similarity checks between embeddings. - Validated handling of edge cases and error scenarios, including empty inputs and API errors.

chonknick added 2 commits May 22, 2025 17:58

Enhance CloudChunker module by adding CodeChunker

73d0588

- Introduced CodeChunker to the cloud chunker module. - Updated __init__.py files to include CodeChunker in the exports for both the cloud and chunker modules.

Merge branch 'main' into development

8fd99e9

gemini-code-assist bot reviewed May 23, 2025

View reviewed changes

greptile-apps bot reviewed May 23, 2025

View reviewed changes

Fix: tokenizer_or_token_counter should be just str type

1168dcf

gemini-code-assist bot suggested changes May 23, 2025

View reviewed changes

chonknick and others added 6 commits May 23, 2025 06:36

Update tests/cloud/test_cloud_code_chunker.py

50c72a3

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Refactor test functions in test_json_porter.py to include return type…

e6fd513

… annotations - Added return type annotations to all test functions for improved clarity and type checking. - Updated the temp_dir fixture to specify its return type as a Generator.

chonknick merged commit f1623e5 into main May 23, 2025
0 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: Add cloud `CodeChunker` + tests #157

Feat: Add cloud `CodeChunker` + tests #157

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	def chunk(self, text: Union[str, List[str]]) -> List[Dict]:
	def chunk(self, text: Union[str, List[str]]) -> Union[List[Dict], List[List[Dict]]]:

		except Exception as error:
		raise ValueError(f"Error parsing the response: {error}") from error

Feat: Add cloud CodeChunker + tests #157

Feat: Add cloud CodeChunker + tests #157

Uh oh!

Conversation

New Feature: CodeChunker Implementation

Updates to Module Initialization

Unit Tests for CodeChunker

Uh oh!

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

Choose a reason for hiding this comment

PR Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

❌ 276 Tests Failed:

Uh oh!

Uh oh!

Uh oh!

Feat: Add cloud `CodeChunker` + tests #157

Feat: Add cloud `CodeChunker` + tests #157