-
Notifications
You must be signed in to change notification settings - Fork 354
perf: use size prefixing for zstd buffer compressor for better decompressing performance #4029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4029 +/- ##
==========================================
+ Coverage 78.68% 78.70% +0.01%
==========================================
Files 285 285
Lines 113471 113590 +119
Branches 113471 113590 +119
==========================================
+ Hits 89289 89398 +109
- Misses 20758 20763 +5
- Partials 3424 3429 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
dc9ed0c
to
034591a
Compare
@@ -3346,7 +3346,15 @@ impl PrimitiveFieldEncoder { | |||
row_number: 0, // legacy encoders do not use | |||
}) | |||
}) | |||
.map(|res_res| res_res.unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably less relevant with this PR, but previously the unwrap
here causing the program to crash since the encoding task error is not handled
} | ||
|
||
#[test] | ||
fn test_compress_zstd_raw_stream_format_and_decompress_with_length_prefixed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dedicated test case is added to verify the previously compressed data could be decompressed as well
034591a
to
19e75c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few suggestions to avoid breaking the 2.1 path here:
pub fn per_value_decompress<T: ArrowNativeType>( |
In that path we first allocate a buffer and then call decompress
multiple times with the same output buffer.
19e75c8
to
6745f80
Compare
…g performance.
6745f80
to
f1afae9
Compare
d3acab3
to
4818a9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause potential forward compatibility issues. Files written with newer versions of lance will not be readable in older versions of lance. This is not a concern for us (lancedb) as we don't make use of this feature when using 2.0.
@yanghua do you have any concerns? If not, I think we can merge once CI passes.
Summary
This PR addresses lancedb/lance#4028 by introducing a length-prefixed format for the Zstd buffer compressor. The goal is to improve the decompression performance of Zstd-compressed buffers.
Solution
Currently, the Zstd buffer compressor uses the standard compression API and decompresses data using the Zstd streaming API. While functional, this approach is not optimal as the streaming decompression introduces overhead.
In this PR, I switch to a block decompression approach by:
Consistency with Existing Approaches
prepend_size
parameter in thecompress_to_buffer
function.Benchmark Results
In limited benchmarks across several input sizes:
References