-
Notifications
You must be signed in to change notification settings - Fork 83
Feat: Create Cython functions for split
and merge
basic ops for chunking!
#163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
58 commits
Select commit
Hold shift + click to select a range
9c848aa
Add compiled `.so` and `.c` autogen files
chonknick 3473565
Fix pyproject.toml issues + add a token_chunker.pyx file for checking…
chonknick 52cc725
Update DOCS for OpenAIGenie — as an experiment
chonknick e4c77de
Add Cython-based text splitting functionality and update chunkers to …
chonknick 0f21614
Add optimized Cython implementation for merging text splits
chonknick 8ce2a36
Add optimized Cython function for finding merge indices in SentenceCh…
chonknick fc0ee01
Refactor setup.py and remove unused token_chunker Cython extension
chonknick 2d4c549
Remove deprecated test file for Cython token chunker
chonknick a7ec5bc
Refactor embedding registration and loading logic in AutoEmbeddings a…
chonknick c29f307
Add `RAGHub`
chonknick 99a5ff7
Update src/chonkie/embeddings/registry.py
chonknick bcd3a49
Initialize embedding instance in AutoEmbeddings class and add fallbac…
chonknick 1ef5f3c
Refactor embedding registration in EmbeddingsRegistry to use register…
chonknick 30dd5ee
Refactor variable naming in EmbeddingsRegistry to improve clarity, ch…
chonknick b6ed0a6
Fix contributing.md
shreyash-chonkie 7df8d08
Fix: Improve error message on embedding model loading failiure
chonknick 85d32fd
Fix: Add type ignore comment for embeddings class instantiation to su…
chonknick 10015e6
Fix: Update string representation in Model2VecEmbeddings for improved…
chonknick 6ccba73
Enhance chunker module by adding NeuralChunker and SlumberChunker to …
chonknick d96cf8a
Refactor SlumberChunker error handling and improve test coverage for …
chonknick 3adeb7b
Enhance error handling in NeuralChunker by adding exception handling …
chonknick f53756a
Remove API_KEY from class variables in NeuralChunker, RecursiveChunke…
chonknick d9ec7ba
Remove API_KEY from SemanticChunker class variable for improved secur…
chonknick 43d8356
Update error messages in NeuralChunker to provide clearer guidance on…
chonknick b4c0961
Update version to 1.0.8 in pyproject.toml and __init__.py for release.
chonknick ae335af
Add comprehensive tests for OverlapRefinery to improve test coverage
openhands-agent ecafc20
Fix pickling issue in BaseTokenizer by using a named method instead o…
openhands-agent 81c42b3
Add comprehensive tests for tokenizer functionality
chonknick c6774d0
Add comprehensive tests for SlumberChunker
chonknick 58d6295
Refactor availability checks in embeddings and refinery classes
chonknick c1d9d37
Enhance tests for SDPMChunker functionality
chonknick 3f3d79b
Add comprehensive tests for Visualizer functionality
chonknick f338185
Enhance CloudChunker module by adding CodeChunker
chonknick 72e2515
Fix: `tokenizer_or_token_counter` should be just str type
chonknick d71d75d
Update tests/cloud/test_cloud_code_chunker.py
chonknick 8100f11
Add comprehensive tests for JSONPorter functionality
chonknick 320b0e4
Refactor test functions in test_json_porter.py to include return type…
chonknick 77ced23
Refactor and expand tests for AutoEmbeddings functionality
chonknick 22deed2
Refactor and expand tests for JinaEmbeddings functionality
chonknick e6c3d49
Refactor and expand tests for VoyageAIEmbeddings functionality
chonknick 9b972a4
Disable multiprocessing in CodeChunker for improved performance and s…
chonknick 07b17bf
Refactor and expand tests for cloud chunkers
chonknick 806e14c
Enhance tests for CohereEmbeddings with mocking and API key handling
chonknick 607f8fa
Refactor test imports and error handling in tokenizer and genie tests
chonknick 838b7a1
Update .gitignore to exclude temporary files
chonknick 3b015b1
Refactor import statements across multiple modules
chonknick 22c64de
Enhance type hints in embedding test files
chonknick 337f1a1
Add GeminiEmbeddings support
chonknick 3cba99c
Update documentation and examples for Gemini embeddings support
chonknick 928d83a
Add Gemini Embeddings tutorial to README
chonknick 9fb2e4b
Remove unused timeout parameter from GeminiEmbeddings initialization
chonknick 7f96d34
Fix: remove unused parameters
chonknick 5037f11
Enhance type hints in cloud code chunker tests
chonknick 5de0710
Optimize performance in OverlapRefinery by implementing caching for t…
chonknick 6991089
Refactor OverlapRefinery to remove cached context size calculation. U…
chonknick a74fcd1
Refactor OverlapRefinery to incorporate effective context size parame…
chonknick c66deab
Enhance OverlapRefinery performance by implementing LRU caching for t…
chonknick 93e34a6
Merge branch 'development' into test-support-for-cython
chonknick File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
"""Setup script for Chonkie's Cython extensions. | ||
|
||
This script configures the Cython extensions used in the Chonkie library. | ||
It includes the token_chunker, split, and merge extensions. | ||
""" | ||
from Cython.Build import cythonize | ||
from setuptools import Extension, setup | ||
|
||
extensions = [ | ||
Extension( | ||
"chonkie.chunker.c_extensions.split", | ||
["src/chonkie/chunker/c_extensions/split.pyx"], | ||
), | ||
Extension( | ||
"chonkie.chunker.c_extensions.merge", | ||
["src/chonkie/chunker/c_extensions/merge.pyx"], | ||
) | ||
] | ||
|
||
setup( | ||
ext_modules=cythonize(extensions), | ||
) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,250 @@ | ||
# cython: language_level=3, boundscheck=False, wraparound=False, cdivision=True, nonecheck=False | ||
""" | ||
Optimized Cython implementation of the _merge_splits function for RecursiveChunker. | ||
|
||
This module provides a performance-optimized version of the text chunk merging algorithm | ||
that is critical for the RecursiveChunker's performance. The algorithm combines small | ||
text splits into larger chunks while respecting token count limits. | ||
|
||
Key Optimizations: | ||
1. C arrays for cumulative count calculations (48% performance improvement) | ||
2. Inline binary search to eliminate function call overhead | ||
3. Maintains identical logic to the original Python implementation | ||
|
||
Performance gains: ~50% faster than the original Python implementation. | ||
""" | ||
|
||
import cython | ||
from typing import List, Tuple | ||
from libc.stdlib cimport malloc, free | ||
|
||
|
||
@cython.boundscheck(False) | ||
@cython.wraparound(False) | ||
def _merge_splits( | ||
list splits, | ||
list token_counts, | ||
int chunk_size, | ||
bint combine_whitespace=False | ||
): | ||
""" | ||
Optimized merge_splits implementation that combines text segments into larger chunks. | ||
|
||
This function takes a list of text splits and their corresponding token counts, | ||
then intelligently merges them into larger chunks that respect the chunk_size limit. | ||
The algorithm uses a greedy approach with binary search to find optimal merge points. | ||
|
||
The implementation maintains identical logic to the original Python version while | ||
using C arrays for cumulative count calculations and inline binary search for | ||
maximum performance. | ||
|
||
Algorithm Overview: | ||
1. Build cumulative token counts using C arrays for fast access | ||
2. For each position, use inline binary search to find the furthest merge point | ||
3. Merge splits within the found range using efficient string operations | ||
4. Continue until all splits are processed | ||
|
||
Time Complexity: O(n log n) where n is the number of splits | ||
Space Complexity: O(n) for cumulative counts and result storage | ||
|
||
Args: | ||
splits (list): List of text segments to merge | ||
token_counts (list): Token count for each corresponding split | ||
chunk_size (int): Maximum allowed tokens per merged chunk | ||
combine_whitespace (bool): Whether to join segments with whitespace. | ||
If True, adds +1 token per join for whitespace. | ||
|
||
Returns: | ||
tuple: (merged_segments, merged_token_counts) | ||
- merged_segments: List of merged text chunks | ||
- merged_token_counts: List of token counts for each merged chunk | ||
|
||
Raises: | ||
ValueError: If splits and token_counts have different lengths | ||
MemoryError: If C array allocation fails | ||
|
||
Example: | ||
>>> splits = ["Hello", "world", "!", "How", "are", "you", "?"] | ||
>>> token_counts = [1, 1, 1, 1, 1, 1, 1] | ||
>>> merged, counts = _merge_splits(splits, token_counts, 3) | ||
>>> merged | ||
['Hello world !', 'How are you', '?'] | ||
>>> counts | ||
[3, 3, 1] | ||
""" | ||
# Declare all C variables at function start | ||
cdef int splits_len = len(splits) | ||
cdef int token_counts_len = len(token_counts) | ||
cdef int* cumulative_counts | ||
cdef int cumulative_sum = 0 | ||
cdef int i, token_count_val | ||
cdef int current_index = 0 | ||
cdef int current_token_count, required_token_count, index | ||
cdef int left, right, mid # Inline binary search variables | ||
cdef str merged_text | ||
cdef int merged_token_count | ||
|
||
# Early exit conditions | ||
if splits_len == 0 or token_counts_len == 0: | ||
return [], [] | ||
|
||
if splits_len != token_counts_len: | ||
raise ValueError( | ||
f"Number of splits {splits_len} does not match " | ||
f"number of token counts {token_counts_len}" | ||
) | ||
|
||
# If all token counts exceed chunk_size, return as-is | ||
if all(count > chunk_size for count in token_counts): | ||
return splits, token_counts | ||
|
||
# OPTIMIZATION 1: Use C array for cumulative counts (48% improvement) | ||
# This eliminates Python list overhead for the performance-critical cumulative sums | ||
cumulative_counts = <int*>malloc((splits_len + 1) * sizeof(int)) | ||
if cumulative_counts is NULL: | ||
raise MemoryError("Failed to allocate memory for cumulative_counts") | ||
|
||
try: | ||
# Build cumulative counts using C array for fast access | ||
cumulative_counts[0] = 0 | ||
for i in range(splits_len): | ||
token_count_val = token_counts[i] | ||
if combine_whitespace: | ||
cumulative_sum += token_count_val + 1 # +1 for whitespace token | ||
else: | ||
cumulative_sum += token_count_val | ||
cumulative_counts[i + 1] = cumulative_sum | ||
|
||
# Main merging loop - maintains original algorithm logic | ||
merged = [] | ||
combined_token_counts = [] | ||
|
||
while current_index < splits_len: | ||
current_token_count = cumulative_counts[current_index] | ||
required_token_count = current_token_count + chunk_size | ||
|
||
# OPTIMIZATION 2: Inline binary search (eliminates function call overhead) | ||
# Original used: index = min(bisect_left(cumulative_counts, required_token_count, lo=current_index) - 1, len(splits)) | ||
left = current_index | ||
right = splits_len + 1 | ||
|
||
# Binary search to find insertion point for required_token_count | ||
while left < right: | ||
mid = (left + right) // 2 | ||
if cumulative_counts[mid] < required_token_count: | ||
left = mid + 1 | ||
else: | ||
right = mid | ||
|
||
# Apply the same logic as original: bisect_left(...) - 1, then min with len(splits) | ||
index = min(left - 1, splits_len) | ||
|
||
# If current_index == index, we need to 10000 move to the next index (same as original) | ||
if index == current_index: | ||
index += 1 | ||
|
||
# Merge splits using the same logic as original | ||
if combine_whitespace: | ||
merged_text = " ".join(splits[current_index:index]) | ||
else: | ||
merged_text = "".join(splits[current_index:index]) | ||
|
||
# Adjust token count (same as original) | ||
merged_token_count = cumulative_counts[min(index, splits_len)] - cumulative_counts[current_index] | ||
|
||
# Add merged result to output lists | ||
merged.append(merged_text) | ||
combined_token_counts.append(merged_token_count) | ||
|
||
# Move to next unprocessed split | ||
current_index = index | ||
|
||
return merged, combined_token_counts | ||
|
||
finally: | ||
# Always free allocated C memory | ||
free(cumulative_counts) | ||
|
||
|
||
@cython.boundscheck(False) | ||
@cython.wraparound(False) | ||
def find_merge_indices( | ||
list token_counts, | ||
int chunk_size, | ||
int start_pos=0 | ||
): | ||
""" | ||
Optimized function to find merge indices using cumulative token counts and binary search. | ||
|
||
This generic function can be used by multiple chunkers (SentenceChunker, CodeChunker, etc.) | ||
that need to find optimal merge points based on token counts and chunk size limits. | ||
|
||
Args: | ||
token_counts (list): List of token counts for each element to be merged | ||
chunk_size (int): Maximum tokens per merged chunk | ||
start_pos (int): Starting position in the token_counts list | ||
|
||
Returns: | ||
list: List of indices where merges should occur | ||
|
||
Example: | ||
>>> token_counts = [10, 15, 20, 5, 8, 12] | ||
>>> chunk_size = 30 | ||
>>> find_merge_indices(token_counts, chunk_size) | ||
[2, 5, 6] # Merge [0:2], [2:5], [5:6] | ||
""" | ||
# Declare all C variables at function start | ||
cdef int counts_len = len(token_counts) | ||
cdef int* cumulative_counts | ||
cdef int cumulative_sum = 0 | ||
cdef int i, token_count_val | ||
cdef int current_pos = start_pos | ||
cdef int target_cumulative, index | ||
cdef int left, right, mid # Binary search variables | ||
|
||
if counts_len == 0: | ||
return [] | ||
|
||
# Build cumulative counts using C array for fast access | ||
cumulative_counts = <int*>malloc((counts_len + 1) * sizeof(int)) | ||
if cumulative_counts is NULL: | ||
raise MemoryError("Failed to allocate memory for cumulative_counts") | ||
|
||
try: | ||
cumulative_counts[0] = 0 | ||
for i in range(counts_len): | ||
token_count_val = token_counts[i] | ||
cumulative_sum += token_count_val | ||
cumulative_counts[i + 1] = cumulative_sum | ||
|
||
# Find merge indices | ||
merge_indices = [] | ||
|
||
while current_pos < counts_len: | ||
target_cumulative = cumulative_counts[current_pos] + chunk_size | ||
|
||
# Inline binary search for insertion point | ||
left = current_pos | ||
right = counts_len + 1 | ||
|
||
while left < right: | ||
mid = (left + right) // 2 | ||
if cumulative_counts[mid] < target_cumulative: | ||
left = mid + 1 | ||
else: | ||
right = mid | ||
|
||
# Apply same logic as chunkers: bisect_left(...) - 1, then bounds checking | ||
index = min(left - 1, counts_len) | ||
|
||
# Ensure we make progress (at least one element per chunk) | ||
if index <= current_pos: | ||
index = current_pos + 1 | ||
|
||
merge_indices.append(index) | ||
current_pos = index | ||
|
||
return merge_indices | ||
|
||
finally: | ||
free(cumulative_counts) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider pinning Cython to a specific version range (e.g.
cython>=3.0.0,<4.0.0
) to prevent future compatibility issues