-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Improve Parquet writer performance #16243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…dd fast path to TemplatedWritePlain
… can deal with uuid/interval
Hah, I just finished implementing a better RLE/BP encoder as well, and was about to submit a PR. 😆 |
I also optimized the loop of the boolean column writer. Now that it has fewer branches, it's ~40% faster. |
Thanks! |
Mytherin
added a commit
that referenced
this pull request
Feb 18, 2025
Follow-up of #16243. Scraping the bottom of the barrel here, as the previous PR got many of the biggest performance gains already. This PR adds some more fast paths for when there are no `NULL`s, and implements a branchless hash function for `string_t`'s that are inlined. This required some extra care to make sure that the hash function returns the same value whether the string is inlined or not. Overall, the changes reduce the time it takes to write TPC-H SF10 `lineitem` to Parquet from ~2.6s to ~2.4s (with the default `PARQUET_VERSION V1`, ~2.5s to ~2.3s with `V2`).
Antonov548
added a commit
to Antonov548/duckdb-r
that referenced
this pull request
Mar 4, 2025
Improve Parquet writer performance (duckdb/duckdb#16243)
krlmlr
pushed a commit
to duckdb/duckdb-r
that referenced
this pull request
Mar 5, 2025
Improve Parquet writer performance (duckdb/duckdb#16243)
theKingEpic
pushed a commit
to theKingEpic/duckdb
that referenced
this pull request
Apr 9, 2025
1. Slightly improve the performance of hashing long strings 2. Implement bit-packing for our `RleBpEncoder` (it only did RLE until now) 3. Implement a custom hash map to build Parquet dictionaries (instead of using `std::unordered_map`) 4. Add many fast paths, e.g., for handling `NULL` values when we know there are no `NULL`s This brings the time it takes to write `lineitem` from TPC-H SF10 to Parquet from ~4.5s to ~2.5s on my laptop, as well as reducing the size from 2.4 GB to 2.0 GB (with the default `PARQUET_VERSION V1`). When enabling `PARQUET_VERSION V2`, time reduces from ~3.8s to ~2.4s on my laptop, and size reduces from 1.7 GB to 1.6 GB. It also adds the parameter `string_dictionary_page_size_limit` to the Parquet writer, which makes the page size of dictionary pages configurable. It defaults to 1 MB.
theKingEpic
added a commit
to theKingEpic/duckdb
that referenced
this pull request
Apr 9, 2025
1. Slightly improve the performance of hashing long strings 2. Implement bit-packing for our `RleBpEncoder` (it only did RLE until now) 3. Implement a custom hash map to build Parquet dictionaries (instead of using `std::unordered_map`) 4. Add many fast paths, e.g., for handling `NULL` values when we know there are no `NULL`s This brings the time it takes to write `lineitem` from TPC-H SF10 to Parquet from ~4.5s to ~2.5s on my laptop, as well as reducing the size from 2.4 GB to 2.0 GB (with the default `PARQUET_VERSION V1`). When enabling `PARQUET_VERSION V2`, time reduces from ~3.8s to ~2.4s on my laptop, and size reduces from 1.7 GB to 1.6 GB. It also adds the parameter `string_dictionary_page_size_limit` to the Parquet writer, which makes the page size of dictionary pages configurable. It defaults to 1 MB.
theKingEpic
added a commit
to theKingEpic/duckdb
that referenced
this pull request
Apr 9, 2025
1. Slightly improve the performance of hashing long strings 2. Implement bit-packing for our `RleBpEncoder` (it only did RLE until now) 3. Implement a custom hash map to build Parquet dictionaries (instead of using `std::unordered_map`) 4. Add many fast paths, e.g., for handling `NULL` values when we know there are no `NULL`s This brings the time it takes to write `lineitem` from TPC-H SF10 to Parquet from ~4.5s to ~2.5s on my laptop, as well as reducing the size from 2.4 GB to 2.0 GB (with the default `PARQUET_VERSION V1`). When enabling `PARQUET_VERSION V2`, time reduces from ~3.8s to ~2.4s on my laptop, and size reduces from 1.7 GB to 1.6 GB. It also adds the parameter `string_dictionary_page_size_limit` to the Parquet writer, which makes the page size of dictionary pages configurable. It defaults to 1 MB.
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 15, 2025
Improve Parquet writer performance (duckdb/duckdb#16243)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 15, 2025
Improve Parquet writer performance (duckdb/duckdb#16243)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 17, 2025
Improve Parquet writer performance (duckdb/duckdb#16243)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 18, 2025
Improve Parquet writer performance (duckdb/duckdb#16243)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
RleBpEncoder
(it only did RLE until now)std::unordered_map
)NULL
values when we know there are noNULL
sThis brings the time it takes to write
lineitem
from TPC-H SF10 to Parquet from ~4.5s to ~2.5s on my laptop, as well as reducing the size from 2.4 GB to 2.0 GB (with the defaultPARQUET_VERSION V1
).When enabling
PARQUET_VERSION V2
, time reduces from ~3.8s to ~2.4s on my laptop, and size reduces from 1.7 GB to 1.6 GB.It also adds the parameter
string_dictionary_page_size_limit
to the Parquet writer, which makes the page size of dictionary pages configurable. It defaults to 1 MB.