Chunking - Optimizing Vector Database Data Preparation

A critical component in the RAG ecosystem is document chunking- the process of splitting text into manageable pieces that can be embedded into your vectordatabase. Most people choose one chunking method and stick with it, but what if there's a best method?

Researchers at ChromaDB evaluated many variations of popular chunking methods, as well as created some new ones, to try and find the best overall method for preparing unstructured text data for downstream RAG applications.

We'll be putting their latest research Evaluating Chunking Strategies for Retrieval to the test to show how each strategy works and find the best one for us.

This will Cover:

Character/Token Based Chunking
Recursive Character/Token Based Chunking
Semantic Chunking
Cluster Semantic Chunking
LLM Semantic Chunking

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
media		media
README.md		README.md
chunking.ipynb		chunking.ipynb
pride_and_prejudice.txt		pride_and_prejudice.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chunking - Optimizing Vector Database Data Preparation

About

Uh oh!

Releases

Languages

ALucek/chunking-strategies

Folders and files

Latest commit

History

Repository files navigation

Chunking - Optimizing Vector Database Data Preparation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages