8000 Feat: Create Cython functions for `split` and `merge` basic ops for chunking! by chonknick · Pull Request #163 · chonkie-inc/chonkie · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Feat: Create Cython functions for split and merge basic ops for chunking! #163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 58 commits into from
May 25, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
9c848aa
Add compiled `.so` and `.c` autogen files
chonknick May 16, 2025
3473565
Fix pyproject.toml issues + add a token_chunker.pyx file for checking…
chonknick May 16, 2025
52cc725
Update DOCS for OpenAIGenie — as an experiment
chonknick May 16, 2025
e4c77de
Add Cython-based text splitting functionality and update chunkers to …
chonknick May 25, 2025
0f21614
Add optimized Cython implementation for merging text splits
chonknick May 25, 2025
8ce2a36
Add optimized Cython function for finding merge indices in SentenceCh…
chonknick May 25, 2025
fc0ee01
Refactor setup.py and remove unused token_chunker Cython extension
chonknick May 25, 2025
2d4c549
Remove deprecated test file for Cython token chunker
chonknick May 25, 2025
a7ec5bc
Refactor embedding registration and loading logic in AutoEmbeddings a…
chonknick May 19, 2025
c29f307
Add `RAGHub`
chonknick May 12, 2025
99a5ff7
Update src/chonkie/embeddings/registry.py
chonknick May 20, 2025
bcd3a49
Initialize embedding instance in AutoEmbeddings class and add fallbac…
chonknick May 20, 2025
1ef5f3c
Refactor embedding registration in EmbeddingsRegistry to use register…
chonknick May 20, 2025
30dd5ee
Refactor variable naming in EmbeddingsRegistry to improve clarity, ch…
chonknick May 20, 2025
b6ed0a6
Fix contributing.md
shreyash-chonkie May 22, 2025
7df8d08
Fix: Improve error message on embedding model loading failiure
chonknick May 22, 2025
85d32fd
Fix: Add type ignore comment for embeddings class instantiation to su…
chonknick May 22, 2025
10015e6
Fix: Update string representation in Model2VecEmbeddings for improved…
chonknick May 22, 2025
6ccba73
Enhance chunker module by adding NeuralChunker and SlumberChunker to …
chonknick May 22, 2025
d96cf8a
Refactor SlumberChunker error handling and improve test coverage for …
chonknick May 22, 2025
3adeb7b
Enhance error handling in NeuralChunker by adding exception handling …
chonknick May 22, 2025
f53756a
Remove API_KEY from class variables in NeuralChunker, RecursiveChunke…
chonknick May 22, 2025
d9ec7ba
Remove API_KEY from SemanticChunker class variable for improved secur…
chonknick May 22, 2025
43d8356
Update error messages in NeuralChunker to provide clearer guidance on…
chonknick May 22, 2025
b4c0961
Update version to 1.0.8 in pyproject.toml and __init__.py for release.
chonknick May 22, 2025
ae335af
Add comprehensive tests for OverlapRefinery to improve test coverage
openhands-agent May 22, 2025
ecafc20
Fix pickling issue in BaseTokenizer by using a named method instead o…
openhands-agent May 22, 2025
81c42b3
Add comprehensive tests for tokenizer functionality
chonknick May 22, 2025
c6774d0
Add comprehensive tests for SlumberChunker
chonknick May 22, 2025
58d6295
Refactor availability checks in embeddings and refinery classes
chonknick May 22, 2025
c1d9d37
Enhance tests for SDPMChunker functionality
chonknick May 22, 2025
3f3d79b
Add comprehensive tests for Visualizer functionality
chonknick May 22, 2025
f338185
Enhance CloudChunker module by adding CodeChunker
chonknick May 23, 2025
72e2515
Fix: `tokenizer_or_token_counter` should be just str type
chonknick May 23, 2025
d71d75d
Update tests/cloud/test_cloud_code_chunker.py
chonknick May 23, 2025
8100f11
Add comprehensive tests for JSONPorter functionality
chonknick May 23, 2025
320b0e4
Refactor test functions in test_json_porter.py to include return type…
chonknick May 23, 2025
77ced23
Refactor and expand tests for AutoEmbeddings functionality
chonknick May 23, 2025
22deed2
Refactor and expand tests for JinaEmbeddings functionality
chonknick May 23, 2025
e6c3d49
Refactor and expand tests for VoyageAIEmbeddings functionality
chonknick May 23, 2025
9b972a4
Disable multiprocessing in CodeChunker for improved performance and s…
chonknick May 23, 2025
07b17bf
Refactor and expand tests for cloud chunkers
chonknick May 23, 2025
806e14c
Enhance tests for CohereEmbeddings with mocking and API key handling
chonknick May 23, 2025
607f8fa
Refactor test imports and error handling in tokenizer and genie tests
chonknick May 23, 2025
838b7a1
Update .gitignore to exclude temporary files
chonknick May 23, 2025
3b015b1
Refactor import statements across multiple modules
chonknick May 23, 2025
22c64de
Enhance type hints in embedding test files
chonknick May 23, 2025
337f1a1
Add GeminiEmbeddings support
chonknick May 23, 2025
3cba99c
Update documentation and examples for Gemini embeddings support
chonknick May 23, 2025
928d83a
Add Gemini Embeddings tutorial to README
chonknick May 23, 2025
9fb2e4b
Remove unused timeout parameter from GeminiEmbeddings initialization
chonknick May 23, 2025
7f96d34
Fix: remove unused parameters
chonknick May 23, 2025
5037f11
Enhance type hints in cloud code chunker tests
chonknick May 25, 2025
5de0710
Optimize performance in OverlapRefinery by implementing caching for t…
chonknick May 25, 2025
6991089
Refactor OverlapRefinery to remove cached context size calculation. U…
chonknick May 25, 2025
a74fcd1
Refactor OverlapRefinery to incorporate effective context size parame…
chonknick May 25, 2025
c66deab
Enhance OverlapRefinery performance by implementing LRU caching for t…
chonknick May 25, 2025
93e34a6
Merge branch 'development' into test-support-for-cython
chonknick May 25, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@ __pycache__
*.py[cod]
*$py.class

# Compiled files and C files
*.so
*.c

# Distribution / packaging
build/*
dist/
Expand Down
36 changes: 28 additions & 8 deletions DOCS.md
Original file line number Diff line number Diff line change
Expand Up @@ -913,23 +913,43 @@ print(json_response)

### `OpenAIGenie`

---

The `OpenAIGenie` class provides an interface to interact with OpenAI's models (like GPT-4) or any LLM provider that offers an OpenAI-compatible API endpoint.

Requires `pip install "chonkie[openai]"`.
**Installation:**

**Parameters:**
`OpenAIGenie` requires `openai` optional dependency to be installed. You can install it via the following command:

- `model (str)`: The specific model identifier to use (e.g., `"gpt-4o"`, `"gpt-3.5-turbo"`). Defaults to `"gpt-4.1"`.
- `base_url (Optional[str])`: The base URL for the API endpoint. If `None`, it defaults to OpenAI's standard API URL. Use this to connect to custom or self-hosted OpenAI-compatible APIs. Defaults to `None`.
- `api_key (Optional[str])`: Your API key for the service (OpenAI or the custom provider). If not provided, it will attempt to read from the `OPENAI_API_KEY` environment variable. Defaults to `None`.
```bash
pip install "chonkie[openai]"
```

**Methods:**
**Class Definition:**

- `generate(prompt: str) -> str`: Sends the prompt to the specified model via the configured endpoint and returns the generated text response.
- `generate_json(prompt: str, schema: BaseModel) -> Dict[str, Any]`: Sends the prompt and a Pydantic `BaseModel` schema to the model, requesting a JSON output that conforms to the schema using OpenAI's JSON mode (or compatible feature). Returns the parsed JSON as a Python dictionary.
```python
class OpenAIGenie(BaseGenie):

# Class Attributes
model: str = "gpt-4.1" # The specific model identifier to use (e.g., "gpt-4o", "gpt-3.5-turbo"). Defaults to "gpt-4.1".
base_url: Optional[str] = None # The base URL for the API endpoint. If None, defaults to OpenAI's standard API URL. Use this to connect to custom or self-hosted OpenAI-compatible APIs. Defaults to None.
api_key: Optional[str] = None # Your API key for the service (OpenAI or the custom provider). If not provided, reads from OPENAI_API_KEY env var. Defaults to None.
client: Optional[OpenAI] = None # The OpenAI client instance. If None, a new client will be created. Defaults to None.

# Class Methods
def generate(self, prompt: str) -> str:
"""Sends the prompt to the specified model via the configured endpoint and returns the generated text response."""
...

def generate_json(self, prompt: str, schema: BaseModel) -> Dict[str, Any]:
"""Sends the prompt and a Pydantic BaseModel schema to the model, requesting a JSON output that conforms to the schema."""
...
```

**Examples:**

Here are some examples of how to use the `OpenAIGenie` class.

<details>
<summary><strong>1. Basic Text Generation with `OpenAIGenie` (OpenAI)</strong></summary>

Expand Down
13 changes: 9 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[build-system]
requires = ["setuptools>=45", "wheel"]
requires = ["setuptools>=45", "wheel", "cython>=3.0.0"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider pinning Cython to a specific version range (e.g. cython>=3.0.0,<4.0.0) to prevent future compatibility issues

build-backend = "setuptools.build_meta"

[project]
Expand All @@ -8,7 +8,7 @@ version = "1.0.8"
description = "🦛 CHONK your texts with Chonkie ✨ - The no-nonsense chunking library"
readme = "README.md"
requires-python = ">=3.9"
license = { file = "LICENSE" }
license = "MIT"
keywords = [
"chunking",
"rag",
Expand All @@ -31,7 +31,6 @@ maintainers = [
]

classifiers = [
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.9",
Expand All @@ -53,7 +52,6 @@ Homepage = "https://github.com/chonkie-inc/chonkie"
Documentation = "https://docs.chonkie.ai"
"Bug Tracker" = "https://github.com/chonkie-inc/chonkie/issues"


[project.optional-dependencies]
# Optional dependencies for the utils
hub = ["huggingface-hub>=0.24.0", "jsonschema>=4.23.0"]
Expand Down Expand Up @@ -114,6 +112,7 @@ dev = [
"coverage",
"ruff>=0.0.265",
"mypy>=1.11.0",
"cython>=3.0.0",
]

[tool.pytest.ini_options]
Expand All @@ -138,6 +137,12 @@ packages = [
"chonkie.cloud.chunker",
]

[tool.setuptools.package-data]
"chonkie.chunker.c_extensions" = ["*.pyx", "*.pxd"]

[tool.setuptools.dynamic]
version = {attr = "chonkie.__version__"}

[tool.ruff]
exclude = ["*.ipynb"]
lint.select = ["F", "I", "D", "DOC"]
Expand Down
22 changes: 22 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
"""Setup script for Chonkie's Cython extensions.

This script configures the Cython extensions used in the Chonkie library.
It includes the token_chunker, split, and merge extensions.
"""
from Cython.Build import cythonize
from setuptools import Extension, setup

extensions = [
Extension(
"chonkie.chunker.c_extensions.split",
["src/chonkie/chunker/c_extensions/split.pyx"],
),
Extension(
"chonkie.chunker.c_extensions.merge",
["src/chonkie/chunker/c_extensions/merge.pyx"],
)
]

setup(
ext_modules=cythonize(extensions),
)
250 changes: 250 additions & 0 deletions src/chonkie/chunker/c_extensions/merge.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
# cython: language_level=3, boundscheck=False, wraparound=False, cdivision=True, nonecheck=False
"""
Optimized Cython implementation of the _merge_splits function for RecursiveChunker.

This module provides a performance-optimized version of the text chunk merging algorithm
that is critical for the RecursiveChunker's performance. The algorithm combines small
text splits into larger chunks while respecting token count limits.

Key Optimizations:
1. C arrays for cumulative count calculations (48% performance improvement)
2. Inline binary search to eliminate function call overhead
3. Maintains identical logic to the original Python implementation

Performance gains: ~50% faster than the original Python implementation.
"""

import cython
from typing import List, Tuple
from libc.stdlib cimport malloc, free


@cython.boundscheck(False)
@cython.wraparound(False)
def _merge_splits(
list splits,
list token_counts,
int chunk_size,
bint combine_whitespace=False
):
"""
Optimized merge_splits implementation that combines text segments into larger chunks.

This function takes a list of text splits and their corresponding token counts,
then intelligently merges them into larger chunks that respect the chunk_size limit.
The algorithm uses a greedy approach with binary search to find optimal merge points.

The implementation maintains identical logic to the original Python version while
using C arrays for cumulative count calculations and inline binary search for
maximum performance.

Algorithm Overview:
1. Build cumulative token counts using C arrays for fast access
2. For each position, use inline binary search to find the furthest merge point
3. Merge splits within the found range using efficient string operations
4. Continue until all splits are processed

Time Complexity: O(n log n) where n is the number of splits
Space Complexity: O(n) for cumulative counts and result storage

Args:
splits (list): List of text segments to merge
token_counts (list): Token count for each corresponding split
chunk_size (int): Maximum allowed tokens per merged chunk
combine_whitespace (bool): Whether to join segments with whitespace.
If True, adds +1 token per join for whitespace.

Returns:
tuple: (merged_segments, merged_token_counts)
- merged_segments: List of merged text chunks
- merged_token_counts: List of token counts for each merged chunk

Raises:
ValueError: If splits and token_counts have different lengths
MemoryError: If C array allocation fails

Example:
>>> splits = ["Hello", "world", "!", "How", "are", "you", "?"]
>>> token_counts = [1, 1, 1, 1, 1, 1, 1]
>>> merged, counts = _merge_splits(splits, token_counts, 3)
>>> merged
['Hello world !', 'How are you', '?']
>>> counts
[3, 3, 1]
"""
# Declare all C variables at function start
cdef int splits_len = len(splits)
cdef int token_counts_len = len(token_counts)
cdef int* cumulative_counts
cdef int cumulative_sum = 0
cdef int i, token_count_val
cdef int current_index = 0
cdef int current_token_count, required_token_count, index
cdef int left, right, mid # Inline binary search variables
cdef str merged_text
cdef int merged_token_count

# Early exit conditions
if splits_len == 0 or token_counts_len == 0:
return [], []

if splits_len != token_counts_len:
raise ValueError(
f"Number of splits {splits_len} does not match "
f"number of token counts {token_counts_len}"
)

# If all token counts exceed chunk_size, return as-is
if all(count > chunk_size for count in token_counts):
return splits, token_counts

# OPTIMIZATION 1: Use C array for cumulative counts (48% improvement)
# This eliminates Python list overhead for the performance-critical cumulative sums
cumulative_counts = <int*>malloc((splits_len + 1) * sizeof(int))
if cumulative_counts is NULL:
raise MemoryError("Failed to allocate memory for cumulative_counts")

try:
# Build cumulative counts using C array for fast access
cumulative_counts[0] = 0
for i in range(splits_len):
token_count_val = token_counts[i]
if combine_whitespace:
cumulative_sum += token_count_val + 1 # +1 for whitespace token
else:
cumulative_sum += token_count_val
cumulative_counts[i + 1] = cumulative_sum

# Main merging loop - maintains original algorithm logic
merged = []
combined_token_counts = []

while current_index < splits_len:
current_token_count = cumulative_counts[current_index]
required_token_count = current_token_count + chunk_size

# OPTIMIZATION 2: Inline binary search (eliminates function call overhead)
# Original used: index = min(bisect_left(cumulative_counts, required_token_count, lo=current_index) - 1, len(splits))
left = current_index
right = splits_len + 1

# Binary search to find insertion point for required_token_count
while left < right:
mid = (left + right) // 2
if cumulative_counts[mid] < required_token_count:
left = mid + 1
else:
right = mid

# Apply the same logic as original: bisect_left(...) - 1, then min with len(splits)
index = min(left - 1, splits_len)

# If current_index == index, we need to 10000 move to the next index (same as original)
if index == current_index:
index += 1

# Merge splits using the same logic as original
if combine_whitespace:
merged_text = " ".join(splits[current_index:index])
else:
merged_text = "".join(splits[current_index:index])

# Adjust token count (same as original)
merged_token_count = cumulative_counts[min(index, splits_len)] - cumulative_counts[current_index]

# Add merged result to output lists
merged.append(merged_text)
combined_token_counts.append(merged_token_count)

# Move to next unprocessed split
current_index = index

return merged, combined_token_counts

finally:
# Always free allocated C memory
free(cumulative_counts)


@cython.boundscheck(False)
@cython.wraparound(False)
def find_merge_indices(
list token_counts,
int chunk_size,
int start_pos=0
):
"""
Optimized function to find merge indices using cumulative token counts and binary search.

This generic function can be used by multiple chunkers (SentenceChunker, CodeChunker, etc.)
that need to find optimal merge points based on token counts and chunk size limits.

Args:
token_counts (list): List of token counts for each element to be merged
chunk_size (int): Maximum tokens per merged chunk
start_pos (int): Starting position in the token_counts list

Returns:
list: List of indices where merges should occur

Example:
>>> token_counts = [10, 15, 20, 5, 8, 12]
>>> chunk_size = 30
>>> find_merge_indices(token_counts, chunk_size)
[2, 5, 6] # Merge [0:2], [2:5], [5:6]
"""
# Declare all C variables at function start
cdef int counts_len = len(token_counts)
cdef int* cumulative_counts
cdef int cumulative_sum = 0
cdef int i, token_count_val
cdef int current_pos = start_pos
cdef int target_cumulative, index
cdef int left, right, mid # Binary search variables

if counts_len == 0:
return []

# Build cumulative counts using C array for fast access
cumulative_counts = <int*>malloc((counts_len + 1) * sizeof(int))
if cumulative_counts is NULL:
raise MemoryError("Failed to allocate memory for cumulative_counts")

try:
cumulative_counts[0] = 0
for i in range(counts_len):
token_count_val = token_counts[i]
cumulative_sum += token_count_val
cumulative_counts[i + 1] = cumulative_sum

# Find merge indices
merge_indices = []

while current_pos < counts_len:
target_cumulative = cumulative_counts[current_pos] + chunk_size

# Inline binary search for insertion point
left = current_pos
right = counts_len + 1

while left < right:
mid = (left + right) // 2
if cumulative_counts[mid] < target_cumulative:
left = mid + 1
else:
right = mid

# Apply same logic as chunkers: bisect_left(...) - 1, then bounds checking
index = min(left - 1, counts_len)

# Ensure we make progress (at least one element per chunk)
if index <= current_pos:
index = current_pos + 1

merge_indices.append(index)
current_pos = index

return merge_indices

finally:
free(cumulative_counts)
Loading
Loading
0