Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader #17325

Mytherin · 2025-05-01T14:13:33Z

This PR generalizes the late materialization optimizer introduced in #15692 - allowing it to be used for the Parquet reader.

In particular, the TableFunction is extended with an extra callback that allows specifying the relevant row-id columns:

typedef vector<column_t> (*table_function_get_row_id_columns)(ClientContext &context,
                                                              optional_ptr<FunctionData> bind_data);

This is then used by the Parquet reader to specify the two row-id columns: file_index (#17144) and file_row_number (#16979). Top-N , sample and limit/offset queries are then transformed into a join on the relevant row-id columns. For example:

SELECT * FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5;

-- becomes

SELECT * FROM lineitem.parquet WHERE (file_index, file_row_number) IN (
    SELECT file_index, file_row_number FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5)
ORDER BY l_extendedprice DESC;

Performance

SELECT * FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5;

v1.2.1	main	new
0.19s	0.14s	0.06s

SELECT * FROM lineitem.parquet ORDER BY l_orderkey DESC LIMIT 5;

v1.2.1	main	new
0.73s	0.53s	0.06s

SELECT * FROM lineitem.parquet LIMIT 1000000 OFFSET 10000000;

v1.2.1	main	new
1.6s	1.2s	0.14s

Refactor

I've also moved the ParquetMultiFileInfo to a separate file as part of this PR - which is most of the changes here.

…eader itself

…and use that in late materialization function

…rform early pruning

djouallah · 2025-05-03T09:09:52Z

@Mytherin do you have any plan to extend this to filtering ? for example

from './orders/*.parquet' where o_orderkey = 1

Mytherin · 2025-05-03T10:30:31Z

Filtering would benefit far less from this - most of the benefits are already gained from pushing filters into the scan (which we already do).

It is also much harder to do for filters since we need to make the decision of whether or not to use late materialization during planning - and it is only beneficial when the result set is very small. That means we need to accurately predict that the result of a filter is small during optimization time which is difficult.

Perhaps something we could do for filtering is e.g. dynamic prefetching, where we choose whether or not to prefetch other columns based on how selective filters are - if filters are very selective we stop pre-emptively prefetching columns. But that's not directly related to this PR of course.

Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)

Mytherin added 10 commits May 1, 2025 10:20

Move ParquetMultiFileInfo to a separate file

7dd0386

Fully handle COLUMN_IDENTIFIER_EMPTY virtual column in the MultiFileR…

c7880ba

…eader itself

Count star test

9dc0703

Add support for callback to get row id columns from table function - …

d748f78

…and use that in late materialization function

Make late materialization work with parquet reads

43c380b

Pushdown file row number into RowNumberColumnReader - use stats to pe…

ec1f074

…rform early pruning

More tests and benchmarks

3b7bdb6

Merge branch 'main' into multifilereaderrowid

804f571

Add missing include

fc8a0e2

Remove these virtual columns from json/csv readers

783ff0b

Mytherin mentioned this pull request May 1, 2025

Let LogicalGet::GetAnyColumn get the file row number #17277

Closed

Mytherin added 2 commits May 1, 2025 19:22

Fix for deserialization of virtual columns

f20fbc1

Format

2f0103e

duckdb-draftbot marked this pull request as draft May 1, 2025 17:23

Mytherin marked this pull request as ready for review May 1, 2025 17:25

Mytherin merged commit ced12aa into duckdb:main May 2, 2025
49 checks passed

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025

vendor: Update vendored sources to duckdb/duckdb@ced12aa

233a028

Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025

vendor: Update vendored sources to duckdb/duckdb@ced12aa

52b7545

Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025

vendor: Update vendored sources to duckdb/duckdb@ced12aa

ac3bb5d

Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025

vendor: Update vendored sources to duckdb/duckdb@ced12aa

69ad182

Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader #17325

Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader #17325

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Support late materialization in the Parquet reader, and handle COUNT(*) directly in the multi file reader #17325

Support late materialization in the Parquet reader, and handle COUNT(*) directly in the multi file reader #17325

Conversation

Uh oh!

Performance

Refactor

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader #17325

Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader #17325