A critical component in the RAG ecosystem is document chunking- the process of splitting text into manageable pieces that can be embedded into your vectordatabase. Most people choose one chunking method and stick with it, but what if there's a best method?
Researchers at ChromaDB evaluated many variations of popular chunking methods, as well as created some new ones, to try and find the best overall method for preparing unstructured text data for downstream RAG applications.
We'll be putting their latest research Evaluating Chunking Strategies for Retrieval to the test to show how each strategy works and find the best one for us.
This will Cover:
- Character/Token Based Chunking
- Recursive Character/Token Based Chunking
- Semantic Chunking
- Cluster Semantic Chunking
- LLM Semantic Chunking