ChNkr

ChNkr is a high-performance NLP library designed to enhance text chunking for large language models (LLMs) and retrieval-augmented generation (RAG) workflows. Built with speed and flexibility in mind, ChNkr supports multiple chunking styles, overlap options, and seamless integration with popular vector databases like Chroma, Pinecone, and Neo4j. 8000

Features

Flexible Chunking Styles: Choose from fixed-token, semantic-aware, or custom chunking methods.
Overlap Support: Ensure context continuity between chunks with adjustable overlap settings.
Vector Store Integration: Direct support for Chroma, Pinecone, and Neo4j.
Custom Models: Support for user-provided semantic models for advanced chunking customization.
Blazing Fast Performance: Built using Cython/C++ for maximum speed.
Ease of Use: Python-friendly API with detailed documentation and examples.

Installation

Prerequisites

Python 3.8+
GCC/Clang (for building Cython/C++ components)
Pip or equivalent package manager

Install from PyPI

pip install chnkr

Build from Source

git clone https://github.com/yourusername/chnkr.git
cd chnkr
pip install .

Quick Start

Basic Usage

from chnkr import Chunker

# Initialize the Chunker with your preferred style
chunker = Chunker(style="semantic", overlap=50)

# Chunk a sample text
text = "Your large text input here..."
chunks = chunker.chunk(text)
print(chunks)

Integration with Chroma

from chnkr import Chunker
from chromadb.client import Client

# Initialize ChNkr
chunker = Chunker(style="fixed", max_tokens=100, overlap=20)

# Chunk your text
text = "Your document text here..."
chunks = chunker.chunk(text)

# Push to Chroma
db_client = Client()
collection = db_client.create_collection("my_collection")
for chunk in chunks:
    collection.add(document=chunk, metadata={"source": "doc_1"})

Integration with Pinecone

from chnkr import Chunker
import pinecone

# Initialize Pinecone
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
index = pinecone.Index("my-index")

# Initialize ChNkr
chunker = Chunker(style="semantic", overlap=50)

# Chunk your text
text = "Your document text here..."
chunks = chunker.chunk(text)

# Push to Pinecone
for chunk in chunks:
    index.upsert([(chunk.id, chunk.vector)])

Integration with Neo4j

from chnkr import Chunker
from neo4j import GraphDatabase

# Initialize Neo4j driver
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

# Initialize ChNkr
chunker = Chunker(style="fixed", max_tokens=200)

# Chunk your text
text = "Your document text here..."
chunks = chunker.chunk(text)

# Push to Neo4j
def add_chunk(tx, chunk):
    tx.run("CREATE (c:Chunk {content: $content, metadata: $metadata})", content=chunk.content, metadata=chunk.metadata)

with driver.session() as session:
    for chunk in chunks:
        session.write_transaction(add_chunk, chunk)

Configuration Options

Parameter	Description	Default Value
`style`	Chunking style (`fixed`, `semantic`, `custom`)	`fixed`
`max_tokens`	Maximum tokens per chunk (for `fixed` style)	`100`
`overlap`	Overlap size between chunks (in tokens)	`0`
`embedding_dim`	Embedding size for semantic chunking	`768`

Development

Setting Up for Development

Clone the repository:

git clone https://github.com/yourusername/chnkr.git
cd chnkr

Install dependencies:
```
pip install -r requirements.txt
```
Build the Cython/C++ components:
```
python setup.py build_ext --inplace
```

Running Tests

pytest tests/

Roadmap

Add support for more vector stores (e.g., Weaviate, Redis).
Implement advanced chunking styles (e.g., topic-based chunking, language-aware chunking).
Extend support for additional languages beyond English.
Add CLI support for quick operations from the terminal.

Contributing

We welcome contributions! Please read our CONTRIBUTING.md for details on how to contribute.

License

ChNkr is licensed under the MIT License. See LICENSE for details.

Acknowledgements

Faiss for semantic chunking.
Pinecone and Chroma for vector store integration.
The open-source NLP community for inspiration and support.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
benchmarks		benchmarks
chnkr		chnkr
docs		docs
python_examples		python_examples
tests		tests
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChNkr

Features

Installation

Prerequisites

Install from PyPI

Build from Source

Quick Start

Basic Usage

Integration with Chroma

Integration with Pinecone

Integration with Neo4j

Configuration Options

Development

Setting Up for Development

Running Tests

Roadmap

Contributing

License

Acknowledgements

About

Releases

Packages

Languages

License

Swish78/chNkr

Folders and files

Latest commit

History

Repository files navigation

ChNkr

Features

Installation

Prerequisites

Install from PyPI

Build from Source

Quick Start

Basic Usage

Integration with Chroma

Integration with Pinecone

Integration with Neo4j

Configuration Options

Development

Setting Up for Development

Running Tests

Roadmap

Contributing

License

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages