8000 Releases · chonkie-inc/chonkie · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Releases: chonkie-inc/chonkie

v1.0.10

07 Jun 19:34
02c2e14
Compare
Choose a tag to compare

What's Changed

  • Fix: Add mypy stubs, py.typed and fix any remaining typing issues from mypy by @chonknick in #186
  • Bump up the version to v1.0.10 by @chonknick in #187

Full Changelog: v1.0.9...v1.0.10

v1.0.9

04 Jun 22:39
98803b5
Compare
Choose a tag to compare

What's Changed

  • Fix pickling issue in BaseTokenizer by @chonknick in #153
  • Improve test coverage for OverlapRefinery by @chonknick in #154
  • Feat: Enhance the tests for OverlapRefinery, Tokenizers and SlumberChunker by @chonknick in #155
  • Feat: Enhance the tests for SPDMChunker, SemanticChunker, NeuralChunker and Visualizer by @chonknick in #156
  • Feat: Add cloud CodeChunker + tests by @chonknick in #157
  • Fix: CodeChunker fails multiprocessing due to pickling issues; Run it sequentially by @chonknick in #158
  • Add GeminiEmbeddings Support by @chonknick in #159
  • Fix: OverlapRefinery with mode="recursive" fails with level gets too low by @chonknick in #161
  • Fix: OverlapRefinery is caching too much; revert context_size calculation caching by @chonknick in #162
  • Feat: Create Cython functions for split and merge basic ops for chunking! by @chonknick in #163
  • Feat: Add CPython optimized methods for split and merge ops for performance boosts! by @chonknick in #164
  • Migrate Docs from chonkie-docs repo to chonkie for easier management + maintainance by @chonknick in #149
  • Migrate DOCS.md by @chonknick in #165
  • Update DOCS by @chonknick in #166
  • Update DOCS.md + Add example for JSONPorter by @chonknick in #167
  • Feat: Add support for PsycopgHandshake by @chonknick in #171
  • Feat: Add PsycopgHandshake for initial support of pgvector by @chonknick in #172
  • Update semantic.py to fix check if any of the split token counts are greater than the max chunk size by @geosmart in #175
  • Fix: Refactor PyscopgHandshake to PgvectorHandshake via vecs by @chonknick in #174
  • Update DOCS.md + Fix SemanticChunker bug for all comparison by @chonknick in #176
  • Update version to 1.0.9 in pyproject.toml for the next release. by @chonknick in #177
  • Feat: Add initial support for experimental.CodeChunker — improved code chunking for few languages by @chonknick in #178
  • Fix: cython module build failing during CD — Add cython build info to pyproject.toml by @chonknick in #180
  • Fix: Use cibuildwheel to build and publish wheels to PyPI by @chonknick in #182
  • Fix: Update the CD script to use uv instead in the hopes of finally publishing the v1.0.9 by @chonknick in #183

New Contributors

Full Changelog: v1.0.8...v1.0.9

v1.0.8

22 May 11:18
Compare
Choose a tag to compare

✨ Highlights

  • Use base_url with OpenAIEmbeddings to use OpenAI API compatible embedding services!
  • You can now provide the AutoEmbeddings a URI string with the alias to choose different providers for a model simply and easily. Just do AutoEmbeddings.get_embedding("model2vec://minishlab/potion-base-8M") or equivalently for the sentence_transformers version do AutoEmbeddings.get_embedding("st://minishlab/potion-base-8M"). This would work with all the supported embeddings in Chonkie. As Chonkie grows, it would support various providers and these notations help you choose between them quite easily.
  • 8000 Added full support for the chonkie.cloud chunkers with updated support for NeuralChunker and SlumberChunker.

What's Changed

  • Tutorial: Add example to show SlumberChunker with OpenRouter models by @chonknick in #103
  • Tutorial: fix the tutorial to remove the Invalid Notebook error by @chonknick in #104
  • Tutorial: Update the readability of the tutorial with better markdown by @chonknick in #105
  • Feat: update workflow by @not-lain in #94
  • Feat: Enhance the CI/CD Pipeline to have linting run in parallel by @chonknick in #106
  • Fix: Linting errors + Add cloud.SDPMChunker + cloud.LateChunker by @chonknick in #107
  • Feat: Add the base_url to the OpenAIEmbeddings and **kwargs support by @chonknick in #108
  • Add RAGHub by @chonknick in #111
  • Fix: Better AutoEmbeddings model matching + URI support for providers by @chonknick in #146
  • Feat: Add URI support via provider://organization/model style strings in AutoEmbeddings by @chonknick in #148
  • Enhance chunker module by adding NeuralChunker and SlumberChunker to the imports and all list for improved functionality. by @chonknick in #150
  • Minor: fix error messages in cloud.NeuralChunker by @chonknick in #152
  • Fix: Add NeuralChunker and SlumberChunker to the chonkie.cloud by @chonknick in #151

Full Changelog: v1.0.7...v1.0.8

v1.0.7

06 May 00:05
f60b690
Compare
Choose a tag to compare

✨ Highlights

  • Added initial support for Handshakes and Porters with 1 new Porter (JSONPorter) as well as 3 new Handshakes (ChromaHandshake, QdrantHandshake and TurbopufferHandshake)! Extemely simple to use, you can just import, init and call on your List[Chunks] that you get out from your chunkers/refineries and query them for search!
  • Added support for OpenAIGenie which allows you to use the OpenAI models with the SlumberChunker and many more! Any API that supports the OpenAI API format will be usable with the base_url changed. Try out various LLMs with the SlumberChunker to see which works best for you!
  • Added support for VoyageAIEmbeddings (thanks to @Pratik960): Use VoyageAI models with EmbeddingsRefinery and SemanticChunkers for your ingestion~
  • Added new themes to the Visualizer: Dark mode and a retro tiktokenizer theme (thanks to @Udayk02)
  • Fixes and perfomance improvements in the NeuralChunker
  • New one-page markdown DOCS.md for easy answers about Chonkie with LLMs.

What's Changed

  • Feat: Add tiktokenizer theme to the Visualizer by @chonknick in #81
  • Feat: Added support for VoyageAI embeddings by @Pratik960 in #67
  • Feat: Add initial support for VoyageAIEmbeddings by @chonknick in #85
  • Fix: Enhance NeuralChunker by adding support for multiple models, custom… by @chonknick in #86
  • Add the local DOCS.md for consistency and ease by @chonknick in #89
  • Update DOCS.md by @chonknick in #90
  • Add BaseHandshake and BasePorter by @chonknick in #91
  • Update cookbook by @chonknick in #92
  • Feat: Add ChromaHandshake + QdrantHandshake + Document + Chomp + More by @chonknick in #93
  • Feat: Add support for OpenAIGenie + TurbopufferHandshake by @chonknick in #95
  • Update README.md by @chonknick in #96
  • Add warning for experimental status of TurbopufferHandshake in turbop… by @chonknick in #97
  • Feat: Viz: Added a new dark theme, separated the light and the dark themes by @Udayk02 in #82
  • Feat: Add dark mode to chonkie's Visualizer by @chonknick in #98
  • Update DOCS.md by @chonknick in #99

New Contributors

Full Changelog: v1.0.6...v1.0.7

v1.0.6

28 Apr 14:44
a6f3631
Compare
Choose a tag to compare

✨ Highlights

  1. Welcome Chonkie's very own agentic chunker, SlumberChunker! Requires the genie optional install to work. Genie is Chonkie's Generative Inference Engines, that allow any generative models or API to easily plug-in with Chonkie. Currently, the genie optional install installs dependencies for GeminiGenie — and will require a GEMINI_API_KEY to work properly.
pip install "chonkie[genie]"
# Import
from chonkie import SlumberChunker

# Initialize
chunker  = SlumberChunker(verbose=True) # set verbose to True, since it takes a while~

# CHONK!
chunker(text) 
  1. A fully neural approach to chunking, NeuralChunker! Requires the neural optional install to work. This uses a BERT-like model that's fine-tuned for chunking, making it really fast and high-quality. Second only to S 10000 lumberChunker in terms of chunk quality.
pip install "chonkie[neural]"
# import 
from chonkie import NeuralChunker

# initialize
chunker = NeuralChunker()

# CHONK!
chunks = chunker(text) 
  1. Added auto language detection for CodeChunker! Now you can just pass in code to it without having to specify the language before hand. It will detect the language by itself. While the latency for detection is minimal (sub-millisecond), it does affect performance, so please specify the language if you care.
# Import
from chonkie import CodeChunker

# Initialize the "auto" CodeChunker
chunker = CodeChunker() # No need to specify, "auto" by default

# CHONK!
chunks = chunker(code) 
  1. Added Genies! Since this version features our first generative feature, the SlumberChunker, we added the Genie to work with it as well as future generative features. Genie's are chonkie's way to handle multiple APIs and model interfaces working together for chunking~ The first Genie to be added is the GeminiGenie — which works with Gemini models. Requires the genie optional install.
pip install "chonkie[genie]"
# Import 
from chonkie import GeminiGenie

# Init
genie = GeminiGenie(api_key=YOUR_API_KEY)

# generate
genie.generate("Hi!") 

# generate JSON
genie.generate_json("Hi", JSON_SCHEMA) 

What's Changed

  • Feat : update test workflow to run on prs even before merging by @not-lain in #69
  • Feat: Run CI/CD on PRs as well by @chonknick in #70
  • Feat: Add support for auto-detecting language in CodeChunker via Magika by @chonknick in #71
  • Feat: Add initial support for SlumberChunker by @chonknick in #73
  • Feat: Add auto language support for CodeChunker + Add initial support for SlumberChunker by @chonknick in #74
  • Update version to 1.0.6a0 in pyproject.toml and init.py for upcom… by @chonknick in #75
  • Fix: Resolve pickling error in BaseTokenizer by replacing lambda by @Harsha-Karimikonda in #76
  • Fix: SlumberChunker's splitting algorithm from RecursiveChunker to custom + Pickling issue in BaseTokenizer by @chonknick in #78
  • Feat: Add initial support for the NeuralChunker by @chonknick in #77
  • Feat: Add support for NeuralChunker + Bump version to v1.0.6 by @chonknick in #79
  • Update README.md by @chonknick in #80

New Contributors

Full Changelog: v1.0.5...v1.0.6

v1.0.6a0

26 Apr 10:29
f4f0d1a
Compare
Choose a tag to compare
v1.0.6a0 Pre-release
Pre-release

What's Changed

  • Feat : update test workflow to run on prs even before merging by @not-lain in #69
  • Feat: Run CI/CD on PRs as well by @chonknick in #70
  • Feat: Add support for auto-detecting language in CodeChunker via Magika by @chonknick in #71
  • Feat: Add initial support for SlumberChunker by @chonknick in #73
  • Feat: Add auto language support for CodeChunker + Add initial support for SlumberChunker by @chonknick in #74
  • Update version to 1.0.6a0 in pyproject.toml and init.py for upcom… by @chonknick in #75

New Contributors

Full Changelog: v1.0.5...v1.0.6a0

v1.0.5

22 Apr 01:18
2e3d242
Compare
Choose a tag to compare

✨ Highlights

  • This is a quick patch release to include CodeChunker in the __init__.py for chonkie so it can be properly accessed via from chonkie import CodeChunker.

What's Changed

  • Fix: Add Code to init files by @shreyash-chonkie in #63
  • Fix: Export CodeChunker properly to the main chonkie.__init__ + Bump up version to v1.0.5 by @chonknick in #64

Full Changelog: v1.0.4...v1.0.5

v1.0.4

21 Apr 11:06
33d3ff5
Compare
Choose a tag to compare

✨ Highlights

  1. Welcome our newest chunker to the family: CodeChunker! CodeChunker is specialized to handle code files and can gain structural understanding of the code before chunking each portion separately~ Supports 100+ programming languages! Let's check out the usage~

Firstly, install the code chunker dependencies via:

pip install "chonkie[code]"

and then simply run it like any other chunker~

# Import
from chonkie import CodeChunker

# Init
chunker = CodeChunker(language="python")

# Get some python code
code = ...

# CHONK!
chunks = chunker(code) 
  1. Added support for JinaAI embeddings with JinaEmbeddings — allowing for use with SemanticChunker and SDPMChunker!

Install it via the following command:

pip install "chonkie[jina]"

and use it like this~

# Import 
from chonkie import JinaEmbeddings 
from chonkie import SemanticChunker

# Optionally import Visaulizer for visalizations
from chonkie import Visualizer()

# Init
viz = Visualizer()
embeddings = JinaEmbeddings()
chunker = SemanticChunker(embeddings) 


# Get some text
text = ...

# CHONK!
chunks = chunker(text) 

# Optional: Visualize!
viz(chunks) 
  1. Added support for the OverlapRefinery which allows you to add overlap context to your chunks~ It's available in the default install and can be used with any chunker. Just chunk with a chunker and pass your chunks through the refinery!
# import
from chonkie import RecursiveChunker, OverlapRefinery

# Init
chunker = RecursiveChunker()
refinery = OverlapRefinery("gpt2") # By default initializes with "character" tokenizer, can pass in "gpt2" to match the chunker

# Get some text 
text = ...

# CHONK and Refine!
chunks = chunker(text) 
chunks = refinery(chunks)
  1. Added support for the EmbeddingsRefinery which allows you to run the chunks through a embedding model and have the embeddings available for downstream loading in a vector database. Similar to the OverlapRefinery just pass in the chunks from a Chunker into a EmbeddingsRefinery object loaded with the appropriate embedding model and each Chunk will then be loaded with .embedding value which can be used downstream.

What's Changed

  • Add initial support for chunking code via CodeChunker by @chonknick in #53
  • Add Initial support for code chunking via CodeChunker by @chonknick in #54
  • Feat: Add Jina AI Embeddings support by @Harsha-Karimikonda in #35
  • Feat: Add support for JinaEmbeddings + OverlapRefinery + EmbeddingsRefinery by @chonknick in #57
  • Fix: Paths for the Chonkie Cloud chunkers; No module named 'chonkie.cloud.chunkers' error by @chonknick in #58
  • Fix: Attempt fixing chonkie.cloud path bug + update README.md to have integrations by @chonknick in #60

New Contributors

Full Changelog: v1.0.3...v1.0.4

v1.0.3

14 Apr 12:08
6ddeba2
Compare
Choose a tag to compare

✨ Highlights

  • The new Chonkie Visualizer is here! You can now view your chunks, understand chunk quality and debug your chunker with visual feedback~ Use the print method to print rich text on your terminal or use the save method to save a highlighted html on your device! It's very simple to use, just pass in your chunks~
from chonkie import Visualizer

viz = Visualizer()

# Print the chunks on the terminal with .print or directly call the Visualizer object too
viz.print(chunks) 

# Save the HTML file
viz.save("chonkie.html", chunks)
image
  • Chonkie now adds support for Recipes which allow you to use multilingual chunking out of the box, as well as document specific chunking methods. Initial support starts with: en, hi, zh, jp and ko, while document type markdown is supported too. Use it via the from_recipe class method with any chunker that takes delimiters or RecursiveRules.
from chonkie import RecursiveChunker

# Initialize the recursive chunker to chunk Markdown
chunker = RecursiveChunker.from_recipe("markdown", lang="en")

# Initialize the recursive chunker to chunk Hindi texts
chunker = RecursiveChunker.from_recipe(lang="hi")
  • Additional fixes for performance enhancements in RecursiveChunker, SentenceChunker, and WordTokenizer

What's Changed

  • Fix: Refactor SentenceChunker to remove estimate + feedback by @chonknick in #23
  • Fix: RecursiveRules.from_dict() showing Key error: 'levels' does not exist because of .pop() by @chonknick in #24
  • Fix: Remove .find for indexing in the RecursiveChunker for better efficiency by @Pratik960 in #26
  • Feat: Add Initial support for Recipes and multilingual CHONKs! by @chonknick in #27
  • Fix: RecursiveChunker whitespace splitting is not reconstructable (missing spaces) + (#26, #27) by @chonknick in #28
  • Update version to 1.0.3a1 in pyproject.toml and init.py for Chonkie by @chonknick in #29
  • Fix: WordTokenizer.count_tokens should use .tokenize instead of .encode by @chonknick in #36
  • Add Chonkie Vizard — easy chunk visualization with Visualizer by @chonknick in #39
  • Feat: Add Chonkie Vizard to main! Visualizer class for easy chunk visualizations by @chonknick in #40
  • Fix: Add chonkie.utils module to package list in pyproject.toml by @chonknick in #41
  • Fix: chonkie.cloud does not contain chunkers error + bump up version to v1.0.3 by @chonknick in #43

New Contributors

Full Changelog: v1.0.2...v1.0.3

v1.0.3a1

10 Apr 11:07
a9efa15
Compare
Choose a tag to compare
v1.0.3a1 Pre-release
Pre-release

What's Changed

  • Fix: Remove .find for indexing in the RecursiveChunker for better efficiency by @Pratik960 in #26
  • Feat: Add Initial support for Recipes and multilingual CHONKs! by @chonknick in #27
  • Fix: RecursiveChunker whitespace splitting is not reconstructable (missing spaces) + (#26, #27) by @chonknick in #28
  • Update version to 1.0.3a1 in pyproject.toml and init.py for Chonkie by @chonknick in #29

New Contributors

Full Changelog: v1.0.3a0...v1.0.3a1

0