8000 Prefetch Parquet page header by lnkuiper · Pull Request #16507 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Prefetch Parquet page header #16507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 6, 2025
Merged

Conversation

lnkuiper
Copy link
Contributor
@lnkuiper lnkuiper commented Mar 4, 2025

This PR prefetches Parquet page headers so that they are read using one FileSystem::Read call, instead of having Thrift call into FileSystem::Read while decoding the page header, causing it to do many tiny reads (sometimes a few bytes at a time). This optimization will only happen when no other prefetches have happened (so it only applies to local reads), and the prefetch will get up to 256 bytes (which should cover most Parquet page headers).

When reading TPC-H SF10 lineitem using this query:

copy lineitem to 'lineitem_v1.parquet' (parquet_version v1); -- with V1 encodings
copy lineitem to 'lineitem_v2.parquet' (parquet_version v2); -- with V2 encodings
select any_value(columns(*)) from 'lineitem_v1.parquet';
select any_value(columns(*)) from 'lineitem_v2.parquet';

Performance improves like so:
V1: 0.48s -> 0.43s
V2: 0.36s -> 0.29s

@duckdb-draftbot duckdb-draftbot marked this pull request as draft March 5, 2025 14:02
@lnkuiper lnkuiper marked this pull request as ready for review March 5, 2025 14:03
@lnkuiper
Copy link
Contributor Author
lnkuiper commented Mar 6, 2025

test/sql/aggregate/aggregates/histogram_table_function.test fails but is unrelated

@lnkuiper lnkuiper requested a review from samansmink March 6, 2025 11:31
Copy link
Contributor
@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Mytherin Mytherin merged commit cc13a64 into duckdb:main Mar 6, 2025
53 of 54 checks passed
@Mytherin
Copy link
Collaborator
Mytherin commented Mar 6, 2025

Thanks!

@lnkuiper lnkuiper deleted the parquet_page_header branch April 14, 2025 09:10
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0