8000 [Enhancement] add parquet DELTA_BINARY_PACKED encoding benchmark by dirtysalt · Pull Request #58470 · StarRocks/starrocks · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Enhancement] add parquet DELTA_BINARY_PACKED encoding benchmark #58470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 7, 2025

Conversation

dirtysalt
Copy link
Contributor
@dirtysalt dirtysalt commented Apr 27, 2025

Why I'm doing:

To see the performance of decode delta_binary_packed encoding data.

What I'm doing:

I added two benchmark cases

  • delta_decode_bench.cpp: to see different implementation of prefix sum
  • parquet_encoding_bench.cpp: to see performance between PLAIN and DELTA_BINARY_PACKED

DeltaDecodeBench

There are great sources of doing prefix sum efficiently

And avx512 shows a very compromising result comparing to naive implemntation:

  • on int32, avx512 is about 3x speedup (505ns vs. 1671ns)
  • on int64, avx512 is about 1.8x speedup (980ns vs. 1671ns)
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
BM_int32_avx512_prefix_sum/4096         505 ns          505 ns      1387154
BM_int32_avx512_prefix_sum/8192        1017 ns         1017 ns       688849
BM_int32_avx512_prefix_sum/16384       2336 ns         2335 ns       299731
BM_int32_avx512_prefix_sum/32768       4665 ns         4665 ns       149991
BM_int32_avx2_prefix_sum/4096           973 ns          973 ns       719452
BM_int32_avx2_prefix_sum/8192          1934 ns         1934 ns       361978
BM_int32_avx2_prefix_sum/16384         4204 ns         4204 ns       166237
BM_int32_avx2_prefix_sum/32768         8399 ns         8398 ns        83391
BM_int32_avx2x_prefix_sum/4096          804 ns          803 ns       871364
BM_int32_avx2x_prefix_sum/8192         1607 ns         1607 ns       435556
BM_int32_avx2x_prefix_sum/16384        3208 ns         3207 ns       218231
BM_int32_avx2x_prefix_sum/32768        6411 ns         6411 ns       109205
BM_int32_native_prefix_sum/4096        1671 ns         1670 ns       419114
BM_int32_native_prefix_sum/8192        3340 ns         3339 ns       209865
BM_int32_native_prefix_sum/16384       6684 ns         6683 ns       104963
BM_int32_native_prefix_sum/32768      13336 ns        13334 ns        52479
BM_int64_avx512_prefix_sum/4096         980 ns          980 ns       712131
BM_int64_avx512_prefix_sum/8192        1942 ns         1942 ns       360513
BM_int64_avx512_prefix_sum/16384       3882 ns         3881 ns       180216
BM_int64_avx512_prefix_sum/32768       7760 ns         7759 ns        90254
BM_int64_native_prefix_sum/4096        1671 ns         1671 ns       418874
BM_int64_native_prefix_sum/8192        3462 ns         3461 ns       202279
BM_int64_native_prefix_sum/16384       6910 ns         6909 ns       101323
BM_int64_native_prefix_sum/32768      13805 ns        13804 ns        50588

ParquetEncodingBench

Several modes:

  • DECOMPRESS: to see latency to decompress a data page compressed by zstd
  • SKIP: to see latency to skip some values
  • RANDOM: to see latency to decode random values
  • SERIES: latency to decode series values
-------------------------------------------------------------------------------------------------------------------
Benchmark                                                                         Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------
BMTestValue<tparquet::Type::INT32, DECOMPRESS>/0/4096                           371 ns          371 ns      1747621 enc=PLAIN,mode=DECOMPRESS,rows=4096,sz=16384,cpsz=16393
BMTestValue<tparquet::Type::INT32, DECOMPRESS>/5/4096                           470 ns          470 ns      1487887 enc=DELTA_BINARY_PACKED,mode=DECOMPRESS,rows=4096,sz=16682,cpsz=16691
BMTestValue<tparquet::Type::INT32, DECOMPRESS>/9/4096                           368 ns          368 ns      1811062 enc=BYTE_STREAM_SPLIT,mode=DECOMPRESS,rows=4096,sz=16384,cpsz=16393
BMTestValue<tparquet::Type::INT32, SKIP>/0/4096                                9.91 ns         9.91 ns     70646959 enc=PLAIN,mode=SKIP,rows=4096,sz=16384,cpsz=16393
BMTestValue<tparquet::Type::INT32, SKIP>/5/4096                                4716 ns         4716 ns       147316 enc=DELTA_BINARY_PACKED,mode=SKIP,rows=4096,sz=16682,cpsz=16691
BMTestValue<tparquet::Type::INT32, SKIP>/9/4096                                 512 ns          511 ns      1369719 enc=BYTE_STREAM_SPLIT,mode=SKIP,rows=4096,sz=16384,cpsz=16393
BMTestValue<tparquet::Type::INT32, RANDOM>/0/4096/-2147483648/2147483647        136 ns          136 ns      4627547 enc=PLAIN,mode=RANDOM,rows=4096,sz=16384,cpsz=16393
BMTestValue<tparquet::Type::INT32, RANDOM>/5/4096/-2147483648/2147483647       4715 ns         4714 ns       148474 enc=DELTA_BINARY_PACKED,mode=RANDOM,rows=4096,sz=16682,cpsz=16691
BMTestValue<tparquet::Type::INT32, RANDOM>/9/4096/-2147483648/2147483647        511 ns          511 ns      1370585 enc=BYTE_STREAM_SPLIT,mode=RANDOM,rows=4096,sz=16384,cpsz=16393
BMTestValue<tparquet::Type::INT32, SERIES>/0/4096/0/127                         145 ns          145 ns      4731256 enc=PLAIN,mode=SERIES,rows=4096,sz=16384,cpsz=12419
BMTestValue<tparquet::Type::INT32, SERIES>/5/4096/0/127                        3979 ns         3978 ns       175839 enc=DELTA_BINARY_PACKED,mode=SERIES,rows=4096,sz=198,cpsz=27
BMTestValue<tparquet::Type::INT32, SERIES>/9/4096/0/127                         511 ns          511 ns      1370541 enc=BYTE_STREAM_SPLIT,mode=SERIES,rows=4096,sz=16384,cpsz=873

And we can see several things:

  • it's fastest to skip and decode values in PLAIN encoding. Because skip is just to forward pointer, and decode is just to memcpy data.
  • the decode latency of PLAIN and DELTA_BINARY_PACKED is about 157ns vs 4716ns (~30x).
  • the decode latency of PLAIN and BYTE_STREAM_SPLITED is about 157ns vs.514ns (~3.2x)
  • for the random values, there is no advantage of DELTA_BINARY_PAKCED: latency is longer, and compressed size is not much smaller.
  • but for series values, the compressed size is much smaller. (12419bytes vs. 27bytes). And BYTE_STREAM_SPLIT also shows a very good compressed size(873bytes)

Decode bottleneck

If I remove prefix sum part, the latency of "DELTA_BINARY_PACKED" is reduc 8000 ed from 4716ns down to 2871ns. So we can think

  • 2871ns is spent on decoding bit width data
  • 1845ns is spent on computing prefix sum.

But from perf report, I don't see hotspot any more. So I think it's the best thing that we can do right now.

-------------------------------------------------------------------------------------------------------------------
Benchmark                                                                         Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------
BMTestValue<tparquet::Type::INT32, DECOMPRESS>/0/4096                           453 ns          453 ns      1683105 enc=PLAIN,mode=DECOMPRESS,rows=4096,sz=16384,cpsz=16393
BMTestValue<tparquet::Type::INT32, DECOMPRESS>/5/4096                           414 ns          414 ns      1808458 enc=DELTA_BINARY_PACKED,mode=DECOMPRESS,rows=4096,sz=16682,cpsz=16691
BMTestValue<tparquet::Type::INT32, DECOMPRESS>/9/4096                           395 ns          395 ns      1883612 enc=BYTE_STREAM_SPLIT,mode=DECOMPRESS,rows=4096,sz=16384,cpsz=16393
BMTestValue<tparquet::Type::INT32, SKIP>/0/4096                                10.3 ns         10.3 ns     68292149 enc=PLAIN,mode=SKIP,rows=4096,sz=16384,cpsz=16393
BMTestValue<tparquet::Type::INT32, SKIP>/5/4096                                2871 ns         2871 ns       242738 enc=DELTA_BINARY_PACKED,mode=SKIP,rows=4096,sz=16682,cpsz=16691
BMTestValue<tparquet::Type::INT32, SKIP>/9/4096                                 520 ns          520 ns      1347691 enc=BYTE_STREAM_SPLIT,mode=SKIP,rows=4096,sz=16384,cpsz=16393
BMTestValue<tparquet::Type::INT32, RANDOM>/0/4096/-2147483648/2147483647        157 ns          157 ns      4482335 enc=PLAIN,mode=RANDOM,rows=4096,sz=16384,cpsz=16393
BMTestValue<tparquet::Type::INT32, RANDOM>/5/4096/-2147483648/2147483647       2867 ns         2866 ns       244109 enc=DELTA_BINARY_PACKED,mode=RANDOM,rows=4096,sz=16682,cpsz=16691
BMTestValue<tparquet::Type::INT32, RANDOM>/9/4096/-2147483648/2147483647        514 ns          514 ns      1362390 enc=BYTE_STREAM_SPLIT,mode=RANDOM,rows=4096,sz=16384,cpsz=16393
BMTestValue<tparquet::Type::INT32, SERIES>/0/4096/0/127                         163 ns          163 ns      4250340 enc=PLAIN,mode=SERIES,rows=4096,sz=16384,cpsz=12419
BMTestValue<tparquet::Type::INT32, SERIES>/5/4096/0/127                        1998 ns         1997 ns       345741 enc=DELTA_BINARY_PACKED,mode=SERIES,rows=4096,sz=198,cpsz=27
BMTestValue<tparquet::Type::INT32, SERIES>/9/4096/0/127                         514 ns          514 ns      1362825 enc=BYTE_STREAM_SPLIT,mode=SERIES,rows=4096,sz=16384,cpsz=873

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.5
    • 3.4
    • 3.3
    • 3.2
    • 3.1

@dirtysalt dirtysalt enabled auto-merge (squash) April 27, 2025 23:34
@dirtysalt dirtysalt requested review from a team as code owners April 28, 2025 10:51
@dirtysalt dirtysalt requested a review from a team as a code owner April 30, 2025 12:52
@dirtysalt dirtysalt changed the title [Enhancement] add parquet encoding benchmark [Enhancement] add parquet DELTA_BINARY_PACKED encoding benchmark May 1, 2025
dirtysalt added 10 commits May 4, 2025 20:03
Signed-off-by: yan zhang <dirtysalt1987@gmail.com>
Signed-off-by: yan zhang <dirtysalt1987@gmail.com>
Signed-off-by: yan zhang <dirtysalt1987@gmail.com>
Signed-off-by: yan zhang <dirtysalt1987@gmail.com>
Signed-off-by: yan zhang <dirtysalt1987@gmail.com>
Signed-off-by: yan zhang <dirtysalt1987@gmail.com>
Signed-off-by: yan zhang <dirtysalt1987@gmail.com>
Signed-off-by: yan zhang <dirtysalt1987@gmail.com>
Signed-off-by: yan zhang <dirtysalt1987@gmail.com>
Signed-off-by: yan zhang <dirtysalt1987@gmail.com>
@dirtysalt dirtysalt force-pushed the add-parquet-benchmark branch from 22a1104 to 32b1306 Compare May 5, 2025 02:21
Signed-off-by: yan zhang <dirtysalt1987@gmail.com>
Copy link
github-actions bot commented May 6, 2025

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link
github-actions bot commented May 6, 2025

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link
github-actions bot commented May 6, 2025

[BE Incremental Coverage Report]

fail : 78 / 120 (65.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/simd/delta_decode.h 57 99 57.58% [22, 23, 25, 26, 27, 28, 30, 31, 32, 34, 35, 40, 42, 43, 46, 47, 48, 49, 50, 51, 52, 53, 56, 57, 58, 59, 60, 63, 64, 66, 67, 69, 70, 71, 74, 77, 78, 79, 81, 194, 202, 257]
🔵 be/src/formats/parquet/encoding_delta.h 21 21 100.00% []

@kangkaisen kangkaisen disabled auto-merge May 7, 2025 01:27
@kangkaisen kangkaisen merged commit 9bcc1e6 into StarRocks:main May 7, 2025
58 of 59 checks passed
@dirtysalt dirtysalt deleted the add-parquet-benchmark branch May 7, 2025 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0