ChNkr is a high-performance NLP library designed to enhance text chunking for large language models (LLMs) and retrieval-augmented generation (RAG) workflows. Built with speed and flexibility in mind, ChNkr supports multiple chunking styles, overlap options, and seamless integration with popular vector databases like Chroma, Pinecone, and Neo4j. 8000
- Flexible Chunking Styles: Choose from fixed-token, semantic-aware, or custom chunking methods.
- Overlap Support: Ensure context continuity between chunks with adjustable overlap settings.
- Vector Store Integration: Direct support for Chroma, Pinecone, and Neo4j.
- Custom Models: Support for user-provided semantic models for advanced chunking customization.
- Blazing Fast Performance: Built using Cython/C++ for maximum speed.
- Ease of Use: Python-friendly API with detailed documentation and examples.
- Python 3.8+
- GCC/Clang (for building Cython/C++ components)
- Pip or equivalent package manager
pip install chnkr
git clone https://github.com/yourusername/chnkr.git
cd chnkr
pip install .
from chnkr import Chunker
# Initialize the Chunker with your preferred style
chunker = Chunker(style="semantic", overlap=50)
# Chunk a sample text
text = "Your large text input here..."
chunks = chunker.chunk(text)
print(chunks)
from chnkr import Chunker
from chromadb.client import Client
# Initialize ChNkr
chunker = Chunker(style="fixed", max_tokens=100, overlap=20)
# Chunk your text
text = "Your document text here..."
chunks = chunker.chunk(text)
# Push to Chroma
db_client = Client()
collection = db_client.create_collection("my_collection")
for chunk in chunks:
collection.add(document=chunk, metadata={"source": "doc_1"})
from chnkr import Chunker
import pinecone
# Initialize Pinecone
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
index = pinecone.Index("my-index")
# Initialize ChNkr
chunker = Chunker(style="semantic", overlap=50)
# Chunk your text
text = "Your document text here..."
chunks = chunker.chunk(text)
# Push to Pinecone
for chunk in chunks:
index.upsert([(chunk.id, chunk.vector)])
from chnkr import Chunker
from neo4j import GraphDatabase
# Initialize Neo4j driver
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
# Initialize ChNkr
chunker = Chunker(style="fixed", max_tokens=200)
# Chunk your text
text = "Your document text here..."
chunks = chunker.chunk(text)
# Push to Neo4j
def add_chunk(tx, chunk):
tx.run("CREATE (c:Chunk {content: $content, metadata: $metadata})", content=chunk.content, metadata=chunk.metadata)
with driver.session() as session:
for chunk in chunks:
session.write_transaction(add_chunk, chunk)
Parameter | Description | Default Value |
---|---|---|
style |
Chunking style (fixed , semantic , custom ) |
fixed |
max_tokens |
Maximum tokens per chunk (for fixed style) |
100 |
overlap |
Overlap size between chunks (in tokens) | 0 |
embedding_dim |
Embedding size for semantic chunking | 768 |
- Clone the repository:
git clone https://github.com/yourusername/chnkr.git cd chnkr
- Install dependencies:
pip install -r requirements.txt
- Build the Cython/C++ components:
python setup.py build_ext --inplace
pytest tests/
- Add support for more vector stores (e.g., Weaviate, Redis).
- Implement advanced chunking styles (e.g., topic-based chunking, language-aware chunking).
- Extend support for additional languages beyond English.
- Add CLI support for quick operations from the terminal.
We welcome contributions! Please read our CONTRIBUTING.md for details on how to contribute.
ChNkr is licensed under the MIT License. See LICENSE for details.