-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Support late materialization in the Parquet reader, and handle COUNT(*)
directly in the multi file reader
#17325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…and use that in late materialization function
…rform early pruning
@Mytherin do you have any plan to extend this to filtering ? for example
|
Filtering would benefit far less from this - most of the benefits are already gained from pushing filters into the scan (which we already do). It is also much harder to do for filters since we need to make the decision of whether or not to use late materialization during planning - and it is only beneficial when the result set is very small. That means we need to accurately predict that the result of a filter is small during optimization time which is difficult. Perhaps something we could do for filtering is e.g. dynamic prefetching, where we choose whether or not to prefetch other columns based on how selective filters are - if filters are very selective we stop pre-emptively prefetching columns. But that's not directly related to this PR of course. |
Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)
Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)
Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)
Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)
This PR generalizes the late materialization optimizer introduced in #15692 - allowing it to be used for the Parquet reader.
In particular, the
TableFunction
is extended with an extra callback that allows specifying the relevant row-id columns:This is then used by the Parquet reader to specify the two row-id columns:
file_index
(#17144) andfile_row_number
(#16979). Top-N , sample and limit/offset queries are then transformed into a join on the relevant row-id columns. For example:Performance
Refactor
I've also moved the
ParquetMultiFileInfo
to a separate file as part of this PR - which is most of the changes here.