8000 Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader by Mytherin · Pull Request #17325 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Support late materialization in the Parquet reader, and handle COUNT(*) directly in the multi file reader #17325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 2, 2025

Conversation

Mytherin
Copy link
Collaborator
@Mytherin Mytherin commented May 1, 2025

This PR generalizes the late materialization optimizer introduced in #15692 - allowing it to be used for the Parquet reader.

In particular, the TableFunction is extended with an extra callback that allows specifying the relevant row-id columns:

typedef vector<column_t> (*table_function_get_row_id_columns)(ClientContext &context,
                                                              optional_ptr<FunctionData> bind_data);

This is then used by the Parquet reader to specify the two row-id columns: file_index (#17144) and file_row_number (#16979). Top-N , sample and limit/offset queries are then transformed into a join on the relevant row-id columns. For example:

SELECT * FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5;

-- becomes

SELECT * FROM lineitem.parquet WHERE (file_index, file_row_number) IN (
    SELECT file_index, file_row_number FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5)
ORDER BY l_extendedprice DESC;

Performance

SELECT * FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5;
v1.2.1 main new
0.19s 0.14s 0.06s
SELECT * FROM lineitem.parquet ORDER BY l_orderkey DESC LIMIT 5;
v1.2.1 main new
0.73s 0.53s 0.06s
SELECT * FROM lineitem.parquet LIMIT 1000000 OFFSET 10000000;
v1.2.1 main new
1.6s 1.2s 0.14s

Refactor

I've also moved the ParquetMultiFileInfo to a separate file as part of this PR - which is most of the changes here.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 1, 2025 17:23
@Mytherin Mytherin marked this pull request as ready for review May 1, 2025 17:25
@Mytherin Mytherin merged commit ced12aa into duckdb:main May 2, 2025
49 checks passed
@djouallah
Copy link

@Mytherin do you have any plan to extend this to filtering ? for example

from './orders/*.parquet' where o_orderkey = 1

@Mytherin
Copy link
Collaborator Author
Mytherin commented May 3, 2025

Filtering would benefit far less from this - most of the benefits are already gained from pushing filters into the scan (which we already do).

It is also much harder to do for filters since we need to make the decision of whether or not to use late materialization during planning - and it is only beneficial when the result set is very small. That means we need to accurately predict that the result of a filter is small during optimization time which is difficult.

Perhaps something we could do for filtering is e.g. dynamic prefetching, where we choose whether or not to prefetch other columns based on how selective filters are - if filters are very selective we stop pre-emptively prefetching columns. But that's not directly related to this PR of course.

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0