Fix pickling issue in BaseTokenizer #153

chonknick · 2025-05-22T16:22:05Z

Fixes #56

This PR fixes the pickling issue in BaseTokenizer by using a named method instead of a lambda function for the defaultdict factory. Lambda functions cannot be pickled when they capture local variables, which was causing the error Can't pickle local object 'BaseTokenizer.__init__.<locals>.<lambda>' when trying to run batching via RecursiveChunker.

…f lambda

gemini-code-assist

Hello @chonknick, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

As gemini-code-assist, I've reviewed this pull request from chonknick. It addresses issue #56 by adding documentation to the BaseTokenizer class in src/chonkie/tokenizer.py. The PR clarifies an existing implementation detail: using a named method (defaulttoken2id) instead of a lambda function as the default_factory for the token2id defaultdict. This structure was already present, but the added comments and updated docstring explain why this is necessary – lambda functions that capture local variables cannot be pickled, which previously caused errors when using features like RecursiveChunker.

Highlights

Documentation Update: Adds comments and docstring updates to the BaseTokenizer class to explain existing code structure.
Pickling Explanation: Clarifies why using a named method for the defaultdict factory is crucial to avoid Python pickling errors.
BaseTokenizer Class: Changes are focused on the __init__ and defaulttoken2id methods within this class.

Changelog

src/chonkie/tokenizer.py
- Added a comment in the __init__ method explaining that using a lambda for the defaultdict factory would cause pickling issues (around line 36).
- Updated the docstring for the defaulttoken2id method to explicitly state its purpose as a default_factory and its role in ensuring the object can be pickled (around lines 42-45).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Lambda's quick and neat,
But pickle finds its capture sweet,
Use a named def,
Avoid pickling death,
Make your objects complete.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

greptile-apps

PR Summary

This PR fixes a pickling issue in BaseTokenizer by replacing a lambda function with a named method for defaultdict's factory function, enabling proper serialization during batch processing.

Replaced lambda 8000 lambda: len(self.vocab) with named method defaulttoken2id() in src/chonkie/tokenizer.py
Added docstring to explain the pickling compatibility rationale
Maintains identical functionality while enabling proper serialization
Critical fix for RecursiveChunker batch processing functionality

_{💡 (5/5) You can turn off certain types of comments like style here!}

_{1 file(s) reviewed, no comment(s)}
_{Edit PR Review Bot Settings | Greptile}

gemini-code-assist

Code Review

This pull request effectively addresses the pickling issue in BaseTokenizer by replacing the lambda function with a named method for the defaultdict factory. The change is clear, concise, and well-explained both in the PR description and through the added code comments and docstring. This is a solid fix that improves the robustness of the tokenizer. Great job!

Summary of Findings

Docstring Clarity for defaulttoken2id: The docstring for defaulttoken2id is helpful in explaining why a named method is used. For even greater clarity, the first line could be slightly more descriptive of its specific function within the defaultdict context. For instance, instead of "Return the default token ID.", something like "Generate a new token ID for an unknown token." or "Provides the default factory logic for token2id, returning the next available ID." could be considered. This is a minor stylistic suggestion and was not added as a direct review comment due to the configured severity threshold for comments.

Merge Readiness

The changes in this pull request are well-implemented and directly resolve the described pickling issue. The code is clear and the accompanying explanations are excellent. I believe this PR is in good shape and ready for merging. As always, ensure any CI checks pass and consider if further review by others is standard practice for your team before merging. I am not authorized to approve pull requests.

Fix pickling issue in BaseTokenizer by using a named method instead o…

83d2722

…f lambda

gemini-code-assist bot reviewed May 22, 2025

View reviewed changes

greptile-apps bot reviewed May 22, 2025

View reviewed changes

gemini-code-assist bot reviewed May 22, 2025

View reviewed changes

chonknick merged commit c43b2f2 into main May 22, 2025
5 checks passed

chonknick deleted the fix-pickle-issue-56 branch May 23, 2025 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix pickling issue in BaseTokenizer #153

Fix pickling issue in BaseTokenizer #153

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix pickling issue in BaseTokenizer #153

Fix pickling issue in BaseTokenizer #153

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

Choose a reason for hiding this comment

PR Summary

Uh oh!

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!