Improve Parquet writer performance #16243

lnkuiper · 2025-02-14T13:53:40Z

Slightly improve the performance of hashing long strings
Implement bit-packing for our RleBpEncoder (it only did RLE until now)
Implement a custom hash map to build Parquet dictionaries (instead of using std::unordered_map)
Add many fast paths, e.g., for handling NULL values when we know there are no NULLs

This brings the time it takes to write lineitem from TPC-H SF10 to Parquet from ~4.5s to ~2.5s on my laptop, as well as reducing the size from 2.4 GB to 2.0 GB (with the default PARQUET_VERSION V1).

When enabling PARQUET_VERSION V2, time reduces from ~3.8s to ~2.4s on my laptop, and size reduces from 1.7 GB to 1.6 GB.

It also adds the parameter string_dictionary_page_size_limit to the Parquet writer, which makes the page size of dictionary pages configurable. It defaults to 1 MB.

… levels

…dd fast path to TemplatedWritePlain

… can deal with uuid/interval

gaborcsardi · 2025-02-14T16:38:21Z

Hah, I just finished implementing a better RLE/BP encoder as well, and was about to submit a PR. 😆
For now it is here: https://github.com/gaborcsardi/duckdb-r/pull/1/files

lnkuiper · 2025-02-17T12:58:29Z

I also optimized the loop of the boolean column writer. Now that it has fewer branches, it's ~40% faster.

Mytherin · 2025-02-18T06:35:58Z

Thanks!

Follow-up of #16243. Scraping the bottom of the barrel here, as the previous PR got many of the biggest performance gains already. This PR adds some more fast paths for when there are no `NULL`s, and implements a branchless hash function for `string_t`'s that are inlined. This required some extra care to make sure that the hash function returns the same value whether the string is inlined or not. Overall, the changes reduce the time it takes to write TPC-H SF10 `lineitem` to Parquet from ~2.6s to ~2.4s (with the default `PARQUET_VERSION V1`, ~2.5s to ~2.3s with `V2`).

Improve Parquet writer performance (duckdb/duckdb#16243)

1. Slightly improve the performance of hashing long strings 2. Implement bit-packing for our `RleBpEncoder` (it only did RLE until now) 3. Implement a custom hash map to build Parquet dictionaries (instead of using `std::unordered_map`) 4. Add many fast paths, e.g., for handling `NULL` values when we know there are no `NULL`s This brings the time it takes to write `lineitem` from TPC-H SF10 to Parquet from ~4.5s to ~2.5s on my laptop, as well as reducing the size from 2.4 GB to 2.0 GB (with the default `PARQUET_VERSION V1`). When enabling `PARQUET_VERSION V2`, time reduces from ~3.8s to ~2.4s on my laptop, and size reduces from 1.7 GB to 1.6 GB. It also adds the parameter `string_dictionary_page_size_limit` to the Parquet writer, which makes the page size of dictionary pages configurable. It defaults to 1 MB.

Improve Parquet writer performance (duckdb/duckdb#16243)

lnkuiper added 16 commits February 12, 2025 16:06

improve performance of hashing longer strings

c640ee1

implement bit-packing for RleBpEncoder and allow for larger dictionaries

f503fba

init primitive dictionary

c8e5916

integrate PrimitiveDictionary into Parquet writer and improve writing…

e136bc7

… levels

make string_dictionary_page_size_limit for Parquet writer configurable

591f090

some more fast paths/optimizations for parquet writer

5d79f5d

add another fast path

503d0c0

use cast operator for src/target types in primitive dictionary, and a…

94e5002

…dd fast path to TemplatedWritePlain

take vectors larger than standard into account

fad8546

take parent nulls into account in fast path writing define levels

1990d37

prefer allocator over unique array

09a5da2

cast and directly write to memorystream in primitive dictionary so we…

8cb2b66

… can deal with uuid/interval

all parquet tests working again

32e1058

slightly tweak hash function

efc5413

add clickbench write benchmark

245f034

codequality fixes and buffer-manage parquet columndatacollections

4a20971

duckdb-draftbot marked this pull request as draft February 14, 2025 15:17

lnkuiper marked this pull request as ready for review February 14, 2025 15:17

format/test fixes for parquet writer

5ac9f9e

duckdb-draftbot marked this pull request as draft February 17, 2025 09:28

lnkuiper marked this pull request as ready for review February 17, 2025 09:28

lnkuiper added 2 commits February 17, 2025 13:20

Merge branch 'main' into parquet_stuff

4b0167d

change result order now that string hash has changed

ee5cc90

duckdb-draftbot marked this pull request as draft February 17, 2025 12:35

improve performance of boolean column writer too

b21d19b

lnkuiper marked this pull request as ready for review February 17, 2025 12:57

lnkuiper added the Needs Documentation Use for issues or PRs that require changes in the documentation label Feb 17, 2025

even faster boolean writing

73e15e8

duckdblabs-bot mentioned this pull request Feb 17, 2025

[duckdb/#16243] - Improve Parquet writer performance needs documentation duckdb/duckdb-web#4825

Closed

duckdb-draftbot marked this pull request as draft February 17, 2025 19:15

Mytherin marked this pull request as ready for review February 17, 2025 21:16

Mytherin merged commit 8605f80 into duckdb:main Feb 18, 2025
49 checks passed

lnkuiper mentioned this pull request Feb 18, 2025

Some more Parquet writer performance improvements #16287

Merged

Antonov548 added a commit to Antonov548/duckdb-r that referenced this pull request Mar 4, 2025

vendor: Update vendored sources to duckdb/duckdb@8605f80

0b8d2b5

Improve Parquet writer performance (duckdb/duckdb#16243)

krlmlr pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 5, 2025

vendor: Update vendored sources to duckdb/duckdb@8605f80

2bad7db

Improve Parquet writer performance (duckdb/duckdb#16243)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025

vendor: Update vendored sources to duckdb/duckdb@8605f80

a5306bc

Improve Parquet writer performance (duckdb/duckdb#16243)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025

vendor: Update vendored sources to duckdb/duckdb@8605f80

2f7ea36

Improve Parquet writer performance (duckdb/duckdb#16243)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025

vendor: Update vendored sources to duckdb/duckdb@8605f80

53bc90d

Improve Parquet writer performance (duckdb/duckdb#16243)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025

vendor: Update vendored sources to duckdb/duckdb@8605f80

c4921d3

Improve Parquet writer performance (duckdb/duckdb#16243)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve Parquet writer performance #16243

Improve Parquet writer performance #16243

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Improve Parquet writer performance #16243

Improve Parquet writer performance #16243

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!