Releases: chonkie-inc/chonkie
Releases · chonkie-inc/chonkie
v1.0.10
What's Changed
- Fix: Add mypy stubs,
py.typed
and fix any remaining typing issues from mypy by @chonknick in #186 - Bump up the version to
v1.0.10
by @chonknick in #187
Full Changelog: v1.0.9...v1.0.10
v1.0.9
What's Changed
- Fix pickling issue in BaseTokenizer by @chonknick in #153
- Improve test coverage for OverlapRefinery by @chonknick in #154
- Feat: Enhance the tests for
OverlapRefinery
,Tokenizers
andSlumberChunker
by @chonknick in #155 - Feat: Enhance the tests for
SPDMChunker
,SemanticChunker
,NeuralChunker
andVisualizer
by @chonknick in #156 - Feat: Add cloud
CodeChunker
+ tests by @chonknick in #157 - Fix:
CodeChunker
fails multiprocessing due to pickling issues; Run it sequentially by @chonknick in #158 - Add GeminiEmbeddings Support by @chonknick in #159
- Fix: OverlapRefinery with
mode="recursive"
fails with level gets too low by @chonknick in #161 - Fix:
OverlapRefinery
is caching too much; revertcontext_size
calculation caching by @chonknick in #162 - Feat: Create Cython functions for
split
andmerge
basic ops for chunking! by @chonknick in #163 - Feat: Add CPython optimized methods for
split
andmerge
ops for performance boosts! by @chonknick in #164 - Migrate Docs from
chonkie-docs
repo tochonkie
for easier management + maintainance by @chonknick in #149 - Migrate DOCS.md by @chonknick in #165
- Update DOCS by @chonknick in #166
- Update DOCS.md + Add example for
JSONPorter
by @chonknick in #167 - Feat: Add support for
PsycopgHandshake
by @chonknick in #171 - Feat: Add
PsycopgHandshake
for initial support ofpgvector
by @chonknick in #172 - Update semantic.py to fix check if any of the split token counts are greater than the max chunk size by @geosmart in #175
- Fix: Refactor
PyscopgHandshake
toPgvectorHandshake
viavecs
by @chonknick in #174 - Update DOCS.md + Fix
SemanticChunker
bug forall
comparison by @chonknick in #176 - Update version to 1.0.9 in pyproject.toml for the next release. by @chonknick in #177
- Feat: Add initial support for
experimental.CodeChunker
— improved code chunking for few languages by @chonknick in #178 - Fix:
cython
module build failing during CD — Addcython
build info topyproject.toml
by @chonknick in #180 - Fix: Use
cibuildwheel
to build and publish wheels to PyPI by @chonknick in #182 - Fix: Update the CD script to use
uv
instead in the hopes of finally publishing thev1.0.9
by @chonknick in #183
New Contributors
Full Changelog: v1.0.8...v1.0.9
v1.0.8
✨ Highlights
- Use
base_url
withOpenAIEmbeddings
to use OpenAI API compatible embedding services! - You can now provide the
AutoEmbeddings
a URI string with the alias to choose different providers for a model simply and easily. Just doAutoEmbeddings.get_embedding("model2vec://minishlab/potion-base-8M")
or equivalently for the sentence_transformers version doAutoEmbeddings.get_embedding("st://minishlab/potion-base-8M")
. This would work with all the supported embeddings in Chonkie. As Chonkie grows, it would support various providers and these notations help you choose between them quite easily. - Added full support for the
chonkie.cloud
chunkers with updated support forNeuralChunker
andSlumberChunker
.
What's Changed
- Tutorial: Add example to show
SlumberChunker
withOpenRouter
models by @chonknick in #103 - Tutorial: fix the tutorial to remove the
Invalid Notebook
error by @chonknick in #104 - Tutorial: Update the readability of the tutorial with better markdown by @chonknick in #105
- Feat: update workflow by @not-lain in #94
- Feat: Enhance the CI/CD Pipeline to have linting run in parallel by @chonknick in #106
- Fix: Linting errors + Add
cloud.SDPMChunker
+cloud.LateChunker
by @chonknick in #107 - Feat: Add the
base_url
to theOpenAIEmbeddings
and**kwargs
support by @chonknick in #108 - Add
RAGHub
by @chonknick in #111 - Fix: Better
AutoEmbeddings
model matching + URI support for providers by @chonknick in #146 - Feat: Add
URI
support viaprovider://organization/model
style strings inAutoEmbeddings
by @chonknick in #148 - Enhance chunker module by adding NeuralChunker and SlumberChunker to the imports and all list for improved functionality. by @chonknick in #150
- Minor: fix error messages in
cloud.NeuralChunker
by @chonknick in #152 - Fix: Add
NeuralChunker
andSlumberChunker
to the chonkie.cloud by @chonknick in #151
Full Changelog: v1.0.7...v1.0.8
v1.0.7
✨ Highlights
- Added initial support for
Handshakes
andPorters
with 1 newPorter
(JSONPorter
) as well as 3 newHandshakes
(ChromaHandshake
,QdrantHandshake
andTurbopufferHandshake
)! Extemely simple to use, you can just import, init and call on yourList[Chunks]
that you get out from your chunkers/refineries and query them for search! - Added support for
OpenAIGenie
which allows you to use theOpenAI
models with theSlumberChunker
and many more! Any API that supports theOpenAI API
format will be usable with thebase_url
changed. Try out various LLMs with theSlumberChunker
to see which works best for you! - Added support for
VoyageAIEmbeddings
(thanks to @Pratik960): UseVoyageAI
models withEmbeddingsRefinery
andSemanticChunkers
for your ingestion~ - Added new themes to the
Visualizer
: Dark mode and a retrotiktokenizer
theme (thanks to @Udayk02) - Fixes and perfomance improvements in the
NeuralChunker
- New one-page markdown
DOCS.md
for easy answers about Chonkie with LLMs.
What's Changed
- Feat: Add
tiktokenizer
theme to theVisualizer
by @chonknick in #81 - Feat: Added support for VoyageAI embeddings by @Pratik960 in #67
- Feat: Add initial support for
VoyageAIEmbeddings
by @chonknick in #85 - Fix: Enhance
NeuralChunker
by adding support for multiple models, custom… by @chonknick in #86 - Add the local DOCS.md for consistency and ease by @chonknick in #89
- Update DOCS.md by @chonknick in #90
- Add
BaseHandshake
andBasePorter
by @chonknick in #91 - Update cookbook by @chonknick in #92
- Feat: Add
ChromaHandshake
+QdrantHandshake
+Document
+Chomp
+ More by @chonknick in #93 - Feat: Add support for
OpenAIGenie
+TurbopufferHandshake
by @chonknick in #95 - Update README.md by @chonknick in #96
- Add warning for experimental status of TurbopufferHandshake in turbop… by @chonknick in #97
- Feat: Viz: Added a new dark theme, separated the light and the dark themes by @Udayk02 in #82
- Feat: Add
dark
mode to chonkie'sVisualizer
by @chonknick in #98 - Update DOCS.md by @chonknick in #99
New Contributors
Full Changelog: v1.0.6...v1.0.7
v1.0.6
✨ Highlights
- Welcome Chonkie's very own agentic chunker,
SlumberChunker
! Requires thegenie
optional install to work.Genie
is Chonkie's Generative Inference Engines, that allow any generative models or API to easily plug-in with Chonkie. Currently, thegenie
optional install installs dependencies forGeminiGenie
— and will require aGEMINI_API_KEY
to work properly.
pip install "chonkie[genie]"
# Import
from chonkie import SlumberChunker
# Initialize
chunker = SlumberChunker(verbose=True) # set verbose to True, since it takes a while~
# CHONK!
chunker(text)
- A fully neural approach to chunking,
NeuralChunker
! Requires theneural
optional install to work. This uses a BERT-like model that's fine-tuned for chunking, making it really fast and high-quality. Second only toSlumberChunker in terms of chunk quality.
pip install "chonkie[neural]"
# import
from chonkie import NeuralChunker
# initialize
chunker = NeuralChunker()
# CHONK!
chunks = chunker(text)
- Added
auto
language detection forCodeChunker
! Now you can just pass in code to it without having to specify the language before hand. It will detect the language by itself. While the latency for detection is minimal (sub-millisecond), it does affect performance, so please specify the language if you care.
# Import
from chonkie import CodeChunker
# Initialize the "auto" CodeChunker
chunker = CodeChunker() # No need to specify, "auto" by default
# CHONK!
chunks = chunker(code)
- Added
Genie
s! Since this version features our first generative feature, theSlumberChunker
, we added theGenie
to work with it as well as future generative features.Genie
's are chonkie's way to handle multiple APIs and model interfaces working together for chunking~ The firstGenie
to be added is theGeminiGenie
— which works withGemini
models. Requires thegenie
optional install.
pip install "chonkie[genie]"
# Import
from chonkie import GeminiGenie
# Init
genie = GeminiGenie(api_key=YOUR_API_KEY)
# generate
genie.generate("Hi!")
# generate JSON
genie.generate_json("Hi", JSON_SCHEMA)
What's Changed
- Feat : update test workflow to run on prs even before merging by @not-lain in #69
- Feat: Run CI/CD on PRs as well by @chonknick in #70
- Feat: Add support for auto-detecting language in
CodeChunker
viaMagika
by @chonknick in #71 - Feat: Add initial support for
SlumberChunker
by @chonknick in #73 - Feat: Add
auto
language support forCodeChunker
+ Add initial support forSlumberChunker
by @chonknick in #74 - Update version to 1.0.6a0 in pyproject.toml and init.py for upcom… by @chonknick in #75
- Fix: Resolve pickling error in BaseTokenizer by replacing lambda by @Harsha-Karimikonda in #76
- Fix:
SlumberChunker
's splitting algorithm fromRecursiveChunker
to custom + Pickling issue inBaseTokenizer
by @chonknick in #78 - Feat: Add initial support for the
NeuralChunker
by @chonknick in #77 - Feat: Add support for
NeuralChunker
+ Bump version tov1.0.6
by @chonknick in #79 - Update README.md by @chonknick in #80
New Contributors
Full Changelog: v1.0.5...v1.0.6
v1.0.6a0
What's Changed
- Feat : update test workflow to run on prs even before merging by @not-lain in #69
- Feat: Run CI/CD on PRs as well by @chonknick in #70
- Feat: Add support for auto-detecting language in
CodeChunker
viaMagika
by @chonknick in #71 - Feat: Add initial support for
SlumberChunker
by @chonknick in #73 - Feat: Add
auto
language support forCodeChunker
+ Add initial support forSlumberChunker
by @chonknick in #74 - Update version to 1.0.6a0 in pyproject.toml and init.py for upcom… by @chonknick in #75
New Contributors
Full Changelog: v1.0.5...v1.0.6a0
v1.0.5
✨ Highlights
- This is a quick patch release to include
CodeChunker
in the__init__.py
for chonkie so it can be properly accessed viafrom chonkie import CodeChunker
.
What's Changed
- Fix: Add Code to init files by @shreyash-chonkie in #63
- Fix: Export
CodeChunker
properly to the mainchonkie.__init__
+ Bump up version tov1.0.5
by @chonknick in #64
Full Changelog: v1.0.4...v1.0.5
v1.0.4
✨ Highlights
- Welcome our newest chunker to the family:
CodeChunker
!CodeChunker
is specialized to handle code files and can gain structural understanding of the code before chunking each portion separately~ Supports 100+ programming languages! Let's check out the usage~
Firstly, install the code chunker dependencies via:
pip install "chonkie[code]"
and then simply run it like any other chunker~
# Import
from chonkie import CodeChunker
# Init
chunker = CodeChunker(language="python")
# Get some python code
code = ...
# CHONK!
chunks = chunker(code)
- Added support for
JinaAI
embeddings withJinaEmbeddings
— allowing for use withSemanticChunker
andSDPMChunker
!
Install it via the following command:
pip install "chonkie[jina]"
and use it like this~
# Import
from chonkie import JinaEmbeddings
from chonkie import SemanticChunker
# Optionally import Visaulizer for visalizations
from chonkie import Visualizer()
# Init
viz = Visualizer()
embeddings = JinaEmbeddings()
chunker = SemanticChunker(embeddings)
# Get some text
text = ...
# CHONK!
chunks = chunker(text)
# Optional: Visualize!
viz(chunks)
- Added support for the
OverlapRefinery
which allows you to add overlap context to your chunks~ It's available in the default install and can be used with any chunker. Just chunk with a chunker and pass your chunks through the refinery!
# import
from chonkie import RecursiveChunker, OverlapRefinery
# Init
chunker = RecursiveChunker()
refinery = OverlapRefinery("gpt2") # By default initializes with "character" tokenizer, can pass in "gpt2" to match the chunker
# Get some text
text = ...
# CHONK and Refine!
chunks = chunker(text)
chunks = refinery(chunks)
- Added support for the
EmbeddingsRefinery
which allows you to run the chunks through a embedding model and have the embeddings available for downstream loading in a vector database. Similar to theOverlapRefinery
just pass in thechunks
from aChunker
into aEmbeddingsRefinery
object loaded with the appropriate embedding model and eachChunk
will then be loaded with.embedding
value which can be used downstream.
What's Changed
- Add initial support for chunking code via
CodeChunker
by @chonknick in #53 - Add Initial support for code chunking via
CodeChunker
by @chonknick in #54 - Feat: Add Jina AI Embeddings support by @Harsha-Karimikonda in #35
- Feat: Add support for JinaEmbeddings + OverlapRefinery + EmbeddingsRefinery by @chonknick in #57
- Fix: Paths for the Chonkie Cloud chunkers;
No module named 'chonkie.cloud.chunkers'
error by @chonknick in #58 - Fix: Attempt fixing
chonkie.cloud
path bug + update README.md to have integrations by @chonknick in #60
New Contributors
- @Harsha-Karimikonda made their first contribution in #35
Full Changelog: v1.0.3...v1.0.4
v1.0.3
✨ Highlights
- The new Chonkie
Visualizer
is here! You can now view your chunks, understand chunk quality and debug your chunker with visual feedback~ Use theprint
method to print rich text on your terminal or use thesave
method to save a highlightedhtml
on your device! It's very simple to use, just pass in your chunks~
from chonkie import Visualizer
viz = Visualizer()
# Print the chunks on the terminal with .print or directly call the Visualizer object too
viz.print(chunks)
# Save the HTML file
viz.save("chonkie.html", chunks)
- Chonkie now adds support for
Recipes
which allow you to use multilingual chunking out of the box, as well as document specific chunking methods. Initial support starts with:en
,hi
,zh
,jp
andko
, while document typemarkdown
is supported too. Use it via thefrom_recipe
class method with any chunker that takes delimiters orRecursiveRules
.
from chonkie import RecursiveChunker
# Initialize the recursive chunker to chunk Markdown
chunker = RecursiveChunker.from_recipe("markdown", lang="en")
# Initialize the recursive chunker to chunk Hindi texts
chunker = RecursiveChunker.from_recipe(lang="hi")
- Additional fixes for performance enhancements in
RecursiveChunker
,SentenceChunker
, andWordTokenizer
What's Changed
- Fix: Refactor
SentenceChunker
to remove estimate + feedback by @chonknick in #23 - Fix:
RecursiveRules.from_dict()
showingKey error: 'levels' does not exist
because of.pop()
by @chonknick in #24 - Fix: Remove
.find
for indexing in theRecursiveChunker
for better efficiency by @Pratik960 in #26 - Feat: Add Initial support for
Recipes
and multilingual CHONKs! by @chonknick in #27 - Fix:
RecursiveChunker
whitespace splitting is not reconstructable (missing spaces) + (#26, #27) by @chonknick in #28 - Update version to 1.0.3a1 in pyproject.toml and init.py for Chonkie by @chonknick in #29
- Fix:
WordTokenizer.count_tokens
should use.tokenize
instead of.encode
by @chonknick in #36 - Add Chonkie Vizard — easy chunk visualization with
Visualizer
by @chonknick in #39 - Feat: Add Chonkie Vizard to main!
Visualizer
class for easy chunk visualizations by @chonknick in #40 - Fix: Add
chonkie.utils
module to package list inpyproject.toml
by @chonknick in #41 - Fix:
chonkie.cloud
does not containchunkers
error + bump up version tov1.0.3
by @chonknick in #43
New Contributors
- @Pratik960 made their first contribution in #26
Full Changelog: v1.0.2...v1.0.3
v1.0.3a1
What's Changed
- Fix: Remove
.find
for indexing in theRecursiveChunker
for better efficiency by @Pratik960 in #26 - Feat: Add Initial support for
Recipes
and multilingual CHONKs! by @chonknick in #27 - Fix:
RecursiveChunker
whitespace splitting is not reconstructable (missing spaces) + (#26, #27) by @chonknick in #28 - Update version to 1.0.3a1 in pyproject.toml and init.py for Chonkie by @chonknick in #29
New Contributors
- @Pratik960 made their first contribution in #26
Full Changelog: v1.0.3a0...v1.0.3a1