Releases · chonkie-inc/chonkie

@chonknick

What's Changed

Fix: Add mypy stubs, py.typed and fix any remaining typing issues from mypy by @chonknick in #186
Bump up the version to v1.0.10 by @chonknick in #187

Full Changelog: v1.0.9...v1.0.10

@chonknick

What's Changed

Fix pickling issue in BaseTokenizer by @chonknick in #153
Improve test coverage for OverlapRefinery by @chonknick in #154
Feat: Enhance the tests for OverlapRefinery, Tokenizers and SlumberChunker by @chonknick in #155
Feat: Enhance the tests for SPDMChunker, SemanticChunker, NeuralChunker and Visualizer by @chonknick in #156
Feat: Add cloud CodeChunker + tests by @chonknick in #157
Fix: CodeChunker fails multiprocessing due to pickling issues; Run it sequentially by @chonknick in #158
Add GeminiEmbeddings Support by @chonknick in #159
Fix: OverlapRefinery with mode="recursive" fails with level gets too low by @chonknick in #161
Fix: OverlapRefinery is caching too much; revert context_size calculation caching by @chonknick in #162
Feat: Create Cython functions for split and merge basic ops for chunking! by @chonknick in #163
Feat: Add CPython optimized methods for split and merge ops for performance boosts! by @chonknick in #164
Migrate Docs from chonkie-docs repo to chonkie for easier management + maintainance by @chonknick in #149
Migrate DOCS.md by @chonknick in #165
Update DOCS by @chonknick in #166
Update DOCS.md + Add example for JSONPorter by @chonknick in #167
Feat: Add support for PsycopgHandshake by @chonknick in #171
Feat: Add PsycopgHandshake for initial support of pgvector by @chonknick in #172
Update semantic.py to fix check if any of the split token counts are greater than the max chunk size by @geosmart in #175
Fix: Refactor PyscopgHandshake to PgvectorHandshake via vecs by @chonknick in #174
Update DOCS.md + Fix SemanticChunker bug for all comparison by @chonknick in #176
Update version to 1.0.9 in pyproject.toml for the next release. by @chonknick in #177
Feat: Add initial support for experimental.CodeChunker — improved code chunking for few languages by @chonknick in #178
Fix: cython module build failing during CD — Add cython build info to pyproject.toml by @chonknick in #180
Fix: Use cibuildwheel to build and publish wheels to PyPI by @chonknick in #182
Fix: Update the CD script to use uv instead in the hopes of finally publishing the v1.0.9 by @chonknick in #183

New Contributors

@geosmart made their first contribution in #175

Full Changelog: v1.0.8...v1.0.9

@chonknick

✨ Highlights

Use base_url with OpenAIEmbeddings to use OpenAI API compatible embedding services!
You can now provide the AutoEmbeddings a URI string with the alias to choose different providers for a model simply and easily. Just do AutoEmbeddings.get_embedding("model2vec://minishlab/potion-base-8M") or equivalently for the sentence_transformers version do AutoEmbeddings.get_embedding("st://minishlab/potion-base-8M"). This would work with all the supported embeddings in Chonkie. As Chonkie grows, it would support various providers and these notations help you choose between them quite easily.
Added full support for the chonkie.cloud chunkers with updated support for NeuralChunker and SlumberChunker.

What's Changed

Tutorial: Add example to show SlumberChunker with OpenRouter models by @chonknick in #103
Tutorial: fix the tutorial to remove the Invalid Notebook error by @chonknick in #104
Tutorial: Update the readability of the tutorial with better markdown by @chonknick in #105
Feat: update workflow by @not-lain in #94
Feat: Enhance the CI/CD Pipeline to have linting run in parallel by @chonknick in #106
Fix: Linting errors + Add cloud.SDPMChunker + cloud.LateChunker by @chonknick in #107
Feat: Add the base_url to the OpenAIEmbeddings and **kwargs support by @chonknick in #108
Add RAGHub by @chonknick in #111
Fix: Better AutoEmbeddings model matching + URI support for providers by @chonknick in #146
Feat: Add URI support via provider://organization/model style strings in AutoEmbeddings by @chonknick in #148
Enhance chunker module by adding NeuralChunker and SlumberChunker to the imports and all list for improved functionality. by @chonknick in #150
Minor: fix error messages in cloud.NeuralChunker by @chonknick in #152
Fix: Add NeuralChunker and SlumberChunker to the chonkie.cloud by @chonknick in #151

Full Changelog: v1.0.7...v1.0.8

@Pratik960

✨ Highlights

Added initial support for Handshakes and Porters with 1 new Porter (JSONPorter) as well as 3 new Handshakes (ChromaHandshake, QdrantHandshake and TurbopufferHandshake)! Extemely simple to use, you can just import, init and call on your List[Chunks] that you get out from your chunkers/refineries and query them for search!
Added support for OpenAIGenie which allows you to use the OpenAI models with the SlumberChunker and many more! Any API that supports the OpenAI API format will be usable with the base_url changed. Try out various LLMs with the SlumberChunker to see which works best for you!
Added support for VoyageAIEmbeddings (thanks to @Pratik960): Use VoyageAI models with EmbeddingsRefinery and SemanticChunkers for your ingestion~
Added new themes to the Visualizer: Dark mode and a retro tiktokenizer theme (thanks to @Udayk02)
Fixes and perfomance improvements in the NeuralChunker
New one-page markdown DOCS.md for easy answers about Chonkie with LLMs.

What's Changed

Feat: Add tiktokenizer theme to the Visualizer by @chonknick in #81
Feat: Added support for VoyageAI embeddings by @Pratik960 in #67
Feat: Add initial support for VoyageAIEmbeddings by @chonknick in #85
Fix: Enhance NeuralChunker by adding support for multiple models, custom… by @chonknick in #86
Add the local DOCS.md for consistency and ease by @chonknick in #89
Update DOCS.md by @chonknick in #90
Add BaseHandshake and BasePorter by @chonknick in #91
Update cookbook by @chonknick in #92
Feat: Add ChromaHandshake + QdrantHandshake + Document + Chomp + More by @chonknick in #93
Feat: Add support for OpenAIGenie + TurbopufferHandshake by @chonknick in #95
Update README.md by @chonknick in #96
Add warning for experimental status of TurbopufferHandshake in turbop… by @chonknick in #97
Feat: Viz: Added a new dark theme, separated the light and the dark themes by @Udayk02 in #82
Feat: Add dark mode to chonkie's Visualizer by @chonknick in #98
Update DOCS.md by @chonknick in #99

New Contributors

@Udayk02 made their first contribution in #82

Full Changelog: v1.0.6...v1.0.7

@not-lain

✨ Highlights

Welcome Chonkie's very own agentic chunker, SlumberChunker! Requires the genie optional install to work. Genie is Chonkie's Generative Inference Engines, that allow any generative models or API to easily plug-in with Chonkie. Currently, the genie optional install installs dependencies for GeminiGenie — and will require a GEMINI_API_KEY to work properly.

pip install "chonkie[genie]"

# Import
from chonkie import SlumberChunker

# Initialize
chunker  = SlumberChunker(verbose=True) # set verbose to True, since it takes a while~

# CHONK!
chunker(text)

A fully neural approach to chunking, NeuralChunker! Requires the neural optional install to work. This uses a BERT-like model that's fine-tuned for chunking, making it really fast and high-quality. Second only to SlumberChunker in terms of chunk quality.


pip install "chonkie[neural]"
# import 
from chonkie import NeuralChunker

# initialize
chunker = NeuralChunker()

# CHONK!
chunks = chunker(text) 

Added auto language detection for CodeChunker! Now you can just pass in code to it without having to specify the language before hand. It will detect the language by itself. While the latency for detection is minimal (sub-millisecond), it does affect performance, so please specify the language if you care.

# Import
from chonkie import CodeChunker

# Initialize the "auto" CodeChunker
chunker = CodeChunker() # No need to specify, "auto" by default

# CHONK!
chunks = chunker(code) 

Added Genies! Since this version features our first generative feature, the SlumberChunker, we added the Genie to work with it as well as future generative features. Genie's are chonkie's way to handle multiple APIs and model interfaces working together for chunking~ The first Genie to be added is the GeminiGenie — which works with Gemini models. Requires the genie optional install.

pip install "chonkie[genie]"
# Import 
from chonkie import GeminiGenie

# Init
genie = GeminiGenie(api_key=YOUR_API_KEY)

# generate
genie.generate("Hi!") 

# generate JSON
genie.generate_json("Hi", JSON_SCHEMA) 
What's Changed

Feat : update test workflow to run on prs even before merging by @not-lain in #69
Feat: Run CI/CD on PRs as well by @chonknick in #70
Feat: Add support for auto-detecting language in CodeChunker via Magika by @chonknick in #71
Feat: Add initial support for SlumberChunker by @chonknick in #73
Feat: Add auto language support for CodeChunker + Add initial support for SlumberChunker by @chonknick in #74
Update version to 1.0.6a0 in pyproject.toml and init.py for upcom… by @chonknick in #75
Fix: Resolve pickling error in BaseTokenizer by replacing lambda by @Harsha-Karimikonda in #76
Fix: SlumberChunker's splitting algorithm from RecursiveChunker to custom + Pickling issue in BaseTokenizer by @chonknick in #78
Feat: Add initial support for the NeuralChunker by @chonknick in #77
Feat: Add support for NeuralChunker + Bump version to v1.0.6 by @chonknick in #79
Update README.md by @chonknick in #80

New Contributors

@not-lain made their first contribution in #69

Full Changelog: v1.0.5...v1.0.6

@not-lain

What's Changed

Feat : update test workflow to run on prs even before merging by @not-lain in #69
Feat: Run CI/CD on PRs as well by @chonknick in #70
Feat: Add support for auto-detecting language in CodeChunker via Magika by @chonknick in #71
Feat: Add initial support for SlumberChunker by @chonknick in #73
Feat: Add auto language support for CodeChunker + Add initial support for SlumberChunker by @chonknick in #74
Update version to 1.0.6a0 in pyproject.toml and init.py for upcom… by @chonknick in #75

New Contributors

@not-lain made their first contribution in #69

Full Changelog: v1.0.5...v1.0.6a0

@shreyash-chonkie

✨ Highlights

This is a quick patch release to include CodeChunker in the __init__.py for chonkie so it can be properly accessed via from chonkie import CodeChunker.

What's Changed

Fix: Add Code to init files by @shreyash-chonkie in #63
Fix: Export CodeChunker properly to the main chonkie.__init__ + Bump up version to v1.0.5 by @chonknick in #64

Full Changelog: v1.0.4...v1.0.5

@chonknick

✨ Highlights

Welcome our newest chunker to the family: CodeChunker! CodeChunker is specialized to handle code files and can gain structural understanding of the code before chunking each portion separately~ Supports 100+ programming languages! Let's check out the usage~

Firstly, install the code chunker dependencies via:

pip install "chonkie[code]"

and then simply run it like any other chunker~

# Import
from chonkie import CodeChunker

# Init
chunker = CodeChunker(language="python")

# Get some python code
code = ...

# CHONK!
chunks = chunker(code)

Added support for JinaAI embeddings with JinaEmbeddings — allowing for use with SemanticChunker and SDPMChunker!

Install it via the following command:

pip install "chonkie[jina]"

and use it like this~

# Import 
from chonkie import JinaEmbeddings 
from chonkie import SemanticChunker

# Optionally import Visaulizer for visalizations
from chonkie import Visualizer()

# Init
viz = Visualizer()
embeddings = JinaEmbeddings()
chunker = SemanticChunker(embeddings) 


# Get some text
text = ...

# CHONK!
chunks = chunker(text) 

# Optional: Visualize!
viz(chunks)

Added support for the OverlapRefinery which allows you to add overlap context to your chunks~ It's available in the default install and can be used with any chunker. Just chunk with a chunker and pass your chunks through the refinery!

# import
from chonkie import RecursiveChunker, OverlapRefinery

# Init
chunker = RecursiveChunker()
refinery = OverlapRefinery("gpt2") # By default initializes with "character" tokenizer, can pass in "gpt2" to match the chunker

# Get some text 
text = ...

# CHONK and Refine!
chunks = chunker(text) 
chunks = refinery(chunks)

Added support for the EmbeddingsRefinery which allows you to run the chunks through a embedding model and have the embeddings available for downstream loading in a vector database. Similar to the OverlapRefinery just pass in the chunks from a Chunker into a EmbeddingsRefinery object loaded with the appropriate embedding model and each Chunk will then be loaded with .embedding value which can be used downstream.

What's Changed

Add initial support for chunking code via CodeChunker by @chonknick in #53
Add Initial support for code chunking via CodeChunker by @chonknick in #54
Feat: Add Jina AI Embeddings support by @Harsha-Karimikonda in #35
Feat: Add support for JinaEmbeddings + OverlapRefinery + EmbeddingsRefinery by @chonknick in #57
Fix: Paths for the Chonkie Cloud chunkers; No module named 'chonkie.cloud.chunkers' error by @chonknick in #58
Fix: Attempt fixing chonkie.cloud path bug + update README.md to have integrations by @chonknick in #60

New Contributors

@Harsha-Karimikonda made their first contribution in #35

Full Changelog: v1.0.3...v1.0.4

@chonknick

✨ Highlights

The new Chonkie Visualizer is here! You can now view your chunks, understand chunk quality and debug your chunker with visual feedback~ Use the print method to print rich text on your terminal or use the save method to save a highlighted html on your device! It's very simple to use, just pass in your chunks~

from chonkie import Visualizer

viz = Visualizer()

# Print the chunks on the terminal with .print or directly call the Visualizer object too
viz.print(chunks) 

# Save the HTML file
viz.save("chonkie.html", chunks)

Chonkie now adds support for Recipes which allow you to use multilingual chunking out of the box, as well as document specific chunking methods. Initial support starts with: en, hi, zh, jp and ko, while document type markdown is supported too. Use it via the from_recipe class method with any chunker that takes delimiters or RecursiveRules.

from chonkie import RecursiveChunker

# Initialize the recursive chunker to chunk Markdown
chunker = RecursiveChunker.from_recipe("markdown", lang="en")

# Initialize the recursive chunker to chunk Hindi texts
chunker = RecursiveChunker.from_recipe(lang="hi")

Additional fixes for performance enhancements in RecursiveChunker, SentenceChunker, and WordTokenizer

What's Changed

Fix: Refactor SentenceChunker to remove estimate + feedback by @chonknick in #23
Fix: RecursiveRules.from_dict() showing Key error: 'levels' does not exist because of .pop() by @chonknick in #24
Fix: Remove .find for indexing in the RecursiveChunker for better efficiency by @Pratik960 in #26
Feat: Add Initial support for Recipes and multilingual CHONKs! by @chonknick in #27
Fix: RecursiveChunker whitespace splitting is not reconstructable (missing spaces) + (#26, #27) by @chonknick in #28
Update version to 1.0.3a1 in pyproject.toml and init.py for Chonkie by @chonknick in #29
Fix: WordTokenizer.count_tokens should use .tokenize instead of .encode by @chonknick in #36
Add Chonkie Vizard — easy chunk visualization with Visualizer by @chonknick in #39
Feat: Add Chonkie Vizard to main! Visualizer class for easy chunk visualizations by @chonknick in #40
Fix: Add chonkie.utils module to package list in pyproject.toml by @chonknick in #41
Fix: chonkie.cloud does not contain chunkers error + bump up version to v1.0.3 by @chonknick in #43

New Contributors

@Pratik960 made their first contribution in #26

Full Changelog: v1.0.2...v1.0.3

@Pratik960

What's Changed

Fix: Remove .find for indexing in the RecursiveChunker for better efficiency by @Pratik960 in #26
Feat: Add Initial support for Recipes and multilingual CHONKs! by @chonknick in #27
Fix: RecursiveChunker whitespace splitting is not reconstructable (missing spaces) + (#26, #27) by @chonknick in #28
Update version to 1.0.3a1 in pyproject.toml and init.py for Chonkie by @chonknick in #29

New Contributors

@Pratik960 made their first contribution in #26

Full Changelog: v1.0.3a0...v1.0.3a1

Releases: chonkie-inc/chonkie

v1.0.10

What's Changed

Contributors

Uh oh!

v1.0.9

What's Changed

New Contributors

Contributors

Uh oh!

v1.0.8

✨ Highlights

What's Changed

Contributors

Uh oh!

v1.0.7

✨ Highlights

What's Changed

New Contributors

Contributors

Uh oh!

v1.0.6

✨ Highlights

What's Changed

New Contributors

Contributors

Uh oh!

v1.0.6a0

What's Changed

New Contributors

Contributors

Uh oh!

v1.0.5

✨ Highlights

What's Changed

Contributors

Uh oh!

v1.0.4

✨ Highlights

What's Changed

New Contributors

Contributors

Uh oh!

v1.0.3

✨ Highlights

What's Changed

New Contributors

Contributors

Uh oh!

v1.0.3a1

What's Changed

New Contributors

Contributors

Uh oh!