perf: use size prefixing for zstd buffer compressor for better decompressing performance #4029

niyue · 2025-06-18T07:46:34Z

Summary

This PR addresses lancedb/lance#4028 by introducing a length-prefixed format for the Zstd buffer compressor. The goal is to improve the decompression performance of Zstd-compressed buffers.

Solution

Currently, the Zstd buffer compressor uses the standard compression API and decompresses data using the Zstd streaming API. While functional, this approach is not optimal as the streaming decompression introduces overhead.

In this PR, I switch to a block decompression approach by:

Prefixing each compressed buffer with its original (uncompressed) length (encoded as a 64-bit little-endian integer)
Using Zstd’s block decompression API during decoding, which avoids the overhead of stream mode.
Leveraging the Zstd magic number and a reserved bit in the Zstd frame header descriptor (as defined in RFC 8878) to distinguish between a standard Zstd stream and the new length-prefixed format, so that compatibility can be guaranteed

Consistency with Existing Approaches

This technique is consistent with the one used by the [Apache Arrow IPC format][1][2], which also embeds the uncompressed size to enable faster, more efficient decompression.
The LZ4 buffer compressor in lance already follows a similar approach, as its API natively supports it via the prepend_size parameter in the compress_to_buffer function.

Benchmark Results

In limited benchmarks across several input sizes:

Decompression speed improved by 30% to 200%, depending on the data.
Compression speed remains unchanged.
Compressed size increases by 8 bytes (due to the uncompressed size prefix).

References

[1] Arrow IPC writer: https://github.com/apache/arrow/blob/bd31f83aaa93db8427bc90285658e29d79ce5efd/cpp/src/arrow/ipc/writer.cc#L230-L233
[2] Arrow IPC reader: https://github.com/apache/arrow/blob/bd31f83aaa93db8427bc90285658e29d79ce5efd/cpp/src/arrow/ipc/reader.cc#L398-L407

github-actions · 2025-06-18T07:46:51Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

codecov-commenter · 2025-06-18T08:21:35Z

Codecov Report

Attention: Patch coverage is 86.99187% with 16 lines in your changes missing coverage. Please review.

Project coverage is 78.70%. Comparing base (1afdf3f) to head (8b05594).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...-encoding/src/encodings/physical/block_compress.rs	91.30%	5 Missing and 5 partials ⚠️
.../lance-encoding/src/encodings/logical/primitive.rs	25.00%	4 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4029      +/-   ##
==========================================
+ Coverage   78.68%   78.70%   +0.01%     
==========================================
  Files         285      285              
  Lines      113471   113590     +119     
  Branches   113471   113590     +119     
==========================================
+ Hits        89289    89398     +109     
- Misses      20758    20763       +5     
- Partials     3424     3429       +5

Flag	Coverage Δ
unittests	`78.70% <86.99%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rust/lance-encoding/src/encodings/physical/block_compress.rs

niyue · 2025-06-18T14:57:45Z

rust/lance-encoding/src/encodings/logical/primitive.rs

@@ -3346,7 +3346,15 @@ impl PrimitiveFieldEncoder {
                row_number: 0, // legacy encoders do not use
            })
        })
-        .map(|res_res| res_res.unwrap())


This is probably less relevant with this PR, but previously the unwrap here causing the program to crash since the encoding task error is not handled

niyue · 2025-06-18T14:59:22Z

rust/lance-encoding/src/encodings/physical/block_compress.rs

+    }
+
+    #[test]
+    fn test_compress_zstd_raw_stream_format_and_decompress_with_length_prefixed() {


A dedicated test case is added to verify the previously compressed data could be decompressed as well

westonpace

I have a few suggestions to avoid breaking the 2.1 path here:

lance/rust/lance-encoding/src/encodings/physical/block_compress.rs

Line 267 in a499cfa

pub fn per_value_decompress<T: ArrowNativeType>(

In that path we first allocate a buffer and then call decompress multiple times with the same output buffer.

rust/lance-encoding/src/encodings/logical/primitive.rs

rust/lance-encoding/src/encodings/physical/block_compress.rs

…g performance.

westonpace

This will cause potential forward compatibility issues. Files written with newer versions of lance will not be readable in older versions of lance. This is not a concern for us (lancedb) as we don't make use of this feature when using 2.0.

@yanghua do you have any concerns? If not, I think we can merge once CI passes.

niyue mentioned this pull request Jun 18, 2025

perf: use size prefixing for zstd buffer compressor for better decompression performance #4028

Open

niyue changed the title ~~Use size prefixing for zstd buffer compressor for better decompressing performance~~ perf: use size prefixing for zstd buffer compressor for better decompressing performance Jun 18, 2025

github-actions bot added the performance label Jun 18, 2025

yanghua reviewed Jun 18, 2025

View reviewed changes

rust/lance-encoding/src/encodings/physical/block_compress.rs Outdated Show resolved Hide resolved

niyue force-pushed the perf/length-prefixed-zstd branch 2 times, most recently from dc9ed0c to 034591a Compare June 18, 2025 14:51

niyue commented Jun 18, 2025

View reviewed changes

niyue force-pushed the perf/length-prefixed-zstd branch from 034591a to 19e75c8 Compare June 18, 2025 15:07

westonpace requested changes Jun 18, 2025

View reviewed changes

rust/lance-encoding/src/encodings/logical/primitive.rs Outdated Show resolved Hide resolved

rust/lance-encoding/src/encodings/physical/block_compress.rs Outdated Show resolved Hide resolved

rust/lance-encoding/src/encodings/physical/block_compress.rs Outdated Show resolved Hide resolved

niyue force-pushed the perf/length-prefixed-zstd branch from 19e75c8 to 6745f80 Compare June 19, 2025 14:29

8000
Use size prefixing for zstd buffer compressor for better decompressin…

f1afae9

…g performance.

niyue force-pushed the perf/length-prefixed-zstd branch from 6745f80 to f1afae9 Compare June 19, 2025 14:30

Make zstd compressor's decompress to work if used multiple times.

4818a9a

niyue force-pushed the perf/length-prefixed-zstd branch from d3acab3 to 4818a9a Compare June 20, 2025 11:06

Merge branch 'main' into perf/length-prefixed-zstd

8b05594

westonpace approved these changes Jun 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: use size prefixing for zstd buffer compressor for better decompressing performance #4029

perf: use size prefixing for zstd buffer compressor for better decompressing performance #4029

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

perf: use size prefixing for zstd buffer compressor for better decompressing performance #4029

Are you sure you want to change the base?

perf: use size prefixing for zstd buffer compressor for better decompressing performance #4029

Conversation

Uh oh!

Summary

Solution

Consistency with Existing Approaches

Benchmark Results

References

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!