8000 Improve Parquet writer performance by lnkuiper · Pull Request #16243 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Improve Parquet writer performance #16243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Feb 18, 2025
Merged

Conversation

lnkuiper
Copy link
Contributor
  1. Slightly improve the performance of hashing long strings
  2. Implement bit-packing for our RleBpEncoder (it only did RLE until now)
  3. Implement a custom hash map to build Parquet dictionaries (instead of using std::unordered_map)
  4. Add many fast paths, e.g., for handling NULL values when we know there are no NULLs

This brings the time it takes to write lineitem from TPC-H SF10 to Parquet from ~4.5s to ~2.5s on my laptop, as well as reducing the size from 2.4 GB to 2.0 GB (with the default PARQUET_VERSION V1).

When enabling PARQUET_VERSION V2, time reduces from ~3.8s to ~2.4s on my laptop, and size reduces from 1.7 GB to 1.6 GB.

It also adds the parameter string_dictionary_page_size_limit to the Parquet writer, which makes the page size of dictionary pages configurable. It defaults to 1 MB.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft February 14, 2025 15:17
@lnkuiper lnkuiper marked this pull request as ready for review February 14, 2025 15:17
@gaborcsardi
Copy link

Hah, I just finished implementing a better RLE/BP encoder as well, and was about to submit a PR. 😆
For now it is here: https://github.com/gaborcsardi/duckdb-r/pull/1/files

@duckdb-draftbot duckdb-draftbot marked this pull request as draft February 17, 2025 09:28
@lnkuiper lnkuiper marked this pull request as ready for review February 17, 2025 09:28
@duckdb-draftbot duckdb-draftbot marked this pull request as draft February 17, 2025 12:35
@lnkuiper lnkuiper marked this pull request as ready for review February 17, 2025 12:57
@lnkuiper
Copy link
Contributor Author

I also optimized the loop of the boolean column writer. Now that it has fewer branches, it's ~40% faster.

@lnkuiper lnkuiper added the Needs Documentation Use for issues or PRs that require changes in the documentation label Feb 17, 2025
@duckdb-draftbot duckdb-draftbot marked this pull request as draft February 17, 2025 19:15
@Mytherin Mytherin marked this pull request as ready for review February 17, 2025 21:16
@Mytherin Mytherin merged commit 8605f80 into duckdb:main Feb 18, 2025
49 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

Mytherin added a commit that referenced this pull request Feb 18, 2025
Follow-up of #16243. Scraping the bottom of the barrel here, as the
previous PR got many of the biggest performance gains already.

This PR adds some more fast paths for when there are no `NULL`s, and
implements a branchless hash function for `string_t`'s that are inlined.
This required some extra care to make sure that the hash function
returns the same value whether the string is inlined or not.

Overall, the changes reduce the time it takes to write TPC-H SF10
`lineitem` to Parquet from ~2.6s to ~2.4s (with the default
`PARQUET_VERSION V1`, ~2.5s to ~2.3s with `V2`).
Antonov548 added a commit to Antonov548/duckdb-r that referenced this pull request Mar 4, 2025
krlmlr pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 5, 2025
theKingEpic pushed a commit to theKingEpic/duckdb that referenced this pull request Apr 9, 2025
1. Slightly improve the performance of hashing long strings
2. Implement bit-packing for our `RleBpEncoder` (it only did RLE until
now)
3. Implement a custom hash map to build Parquet dictionaries (instead of
using `std::unordered_map`)
4. Add many fast paths, e.g., for handling `NULL` values when we know
there are no `NULL`s

This brings the time it takes to write `lineitem` from TPC-H SF10 to
Parquet from ~4.5s to ~2.5s on my laptop, as well as reducing the size
from 2.4 GB to 2.0 GB (with the default `PARQUET_VERSION V1`).

When enabling `PARQUET_VERSION V2`, time reduces from ~3.8s to ~2.4s on
my laptop, and size reduces from 1.7 GB to 1.6 GB.

It also adds the parameter `string_dictionary_page_size_limit` to the
Parquet writer, which makes the page size of dictionary pages
configurable. It defaults to 1 MB.
theKingEpic added a commit to theKingEpic/duckdb that referenced this pull request Apr 9, 2025
1. Slightly improve the performance of hashing long strings
2. Implement bit-packing for our `RleBpEncoder` (it only did RLE until
now)
3. Implement a custom hash map to build Parquet dictionaries (instead of
using `std::unordered_map`)
4. Add many fast paths, e.g., for handling `NULL` values when we know
there are no `NULL`s

This brings the time it takes to write `lineitem` from TPC-H SF10 to
Parquet from ~4.5s to ~2.5s on my laptop, as well as reducing the size
from 2.4 GB to 2.0 GB (with the default `PARQUET_VERSION V1`).

When enabling `PARQUET_VERSION V2`, time reduces from ~3.8s to ~2.4s on
my laptop, and size reduces from 1.7 GB to 1.6 GB.

It also adds the parameter `string_dictionary_page_size_limit` to the
Parquet writer, which makes the page size of dictionary pages
configurable. It defaults to 1 MB.
theKingEpic added a commit to theKingEpic/duckdb that referenced this pull request Apr 9, 2025
1. Slightly improve the performance of hashing long strings
2. Implement bit-packing for our `RleBpEncoder` (it only did RLE until
now)
3. Implement a custom hash map to build Parquet dictionaries (instead of
using `std::unordered_map`)
4. Add many fast paths, e.g., for handling `NULL` values when we know
there are no `NULL`s

This brings the time it takes to write `lineitem` from TPC-H SF10 to
Parquet from ~4.5s to ~2.5s on my laptop, as well as reducing the size
from 2.4 GB to 2.0 GB (with the default `PARQUET_VERSION V1`).

When enabling `PARQUET_VERSION V2`, time reduces from ~3.8s to ~2.4s on
my laptop, and size reduces from 1.7 GB to 1.6 GB.

It also adds the parameter `string_dictionary_page_size_limit` to the
Parquet writer, which makes the page size of dictionary pages
configurable. It defaults to 1 MB.
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Documentation Use for issues or PRs that require changes in the documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0