MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers #16882

Mytherin · 2025-03-28T09:02:20Z

This PR is a follow-up of #16838

The above PR laid the ground work for removing the cast map and global column definitions from the file reader. This PR finalizes that work and (almost*) removes the need for the base file reader to know about the existence of the multi file reader and its mysterious constructs.

The MultiFileReaderData is now gone, and everything that used to be there has been moved into the BaseFileReader instead:

class BaseFileReader {
public:
    //! The name of the file we are reading
    string file_name;
    //! (Optionally) The file index (generated by the multi file reader)
    optional_idx file_list_idx;
    //! The set of columns for the current file
    vector<MultiFileColumnDefinition> columns;
    //! The column ids to read from the file
    MultiFileLocalColumnIds<MultiFileLocalColumnId> column_ids;
    //! The column indexes to read from the file
    vector<ColumnIndex> column_indexes;
    //! The set of table filters (adjusted to local indexes)
    unique_ptr<TableFilterSet> filters;
    //! Expression to execute for a given column (BEFORE executing the filter)
    //! NOTE: this is only set when we have filters - it can be ignored for readers that don't have filter pushdown
    unordered_map<column_t, unique_ptr<Expression>> expression_map;
    //! The final types for various expressions - this is ONLY used if UseCastMap() is explicitly enabled
    unordered_map<column_t, LogicalType> cast_map;
};

Table Filters

Table filters form a problem when pushing down into the scanners, because of a potential type mismatch. Table filters are defined in the global type, since that is the type that is visible in the query. As such, we cannot push them down into the files columns without converting the filter.

What's more - it is not always possible to convert this filter. This applies to comparisons - but becomes more obvious when expressions are involved in filters - e.g. the filter substring(col, 1, 3) = '100' clearly requires col to be of type VARCHAR - we cannot transform this into a filter on an INTEGER column.

This PR attempts to push down table filters when possible and performing the type conversion. This is generally possible with integer types (i.e. we can convert a filter on type INT64 to a filter on type INT32). This happens in this code snippet:

unique_ptr<TableFilter> local_filter;
if (local_type == global_type) {
    // no conversion required - just copy the filter
    local_filter = global_filter.Copy();
} else {
    // types are different - try to convert
    // first check if we can safely convert (i.e. if the conversion is lossless would not change the result)
    if (StatisticsPropagator::CanPropagateCast(local_type, global_type)) {
        // if we can convert - try to actually convert
        local_filter = TryCastTableFilter(global_filter, map_entry.mapping, local_type);
    }
}

However, we need to have a fallback when the conversion is not possible. What happens in this case is that we push the transformation expression into a (new) expression_map. This expression must be evaluated in the reader prior to executing the filter:

// add the expression to the expression map - we are now evaluating this inside the reader directly
// we need to set the index of the references inside the expression to 0
SetIndexToZero(*reader_data.expressions[local_id]);
reader.expression_map[filter_idx] = std::move(reader_data.expressions[local_id]);

// reset the expression - since we are evaluating it in the reader we can just reference it
reader_data.expressions[local_id] = make_uniq<BoundReferenceExpression>(global_type, local_id);

While this sounds more complex than the cast map - this approach has several advantages:

The table filter transformation is required to take advantage of table filters for e.g. row group pruning. Previously this logic existed in the Parquet reader - this is now generalized to the multi file reader.
By using an expression map instead of a cast map we can support more complex expressions - this is not required yet in this PR but will be required going forward (for e.g. struct evolution)
The expression_map is only used when filters are enabled - this is currently only the case for the Parquet reader. All other readers are simplified and don't need to worry about this at all.

Cast Map

The cast_map has not been removed, but is not used by default. It still exists, and readers can opt in to using it by setting the UseCastMap() flag. This is only used by the CSV reader.

The reason this is relevant for the CSV reader is that, unlike e.g. Parquet, CSV files don't have "strict" types - so we actually want types to be pushed down into the CSV reader directly. This prevents us from inferring the wrong types and throwing errors unnecessarily. In addition, the CSV reader can also directly perform casts while reading which makes this faster.

Re-organization

In order to make reviewing as hard as possible - I've also reorganized the folders and class names of the various multi file reader classes. For many classes, I've made the prefix MultiFile instead of MultiFileReader (e.g. MultiFileReaderOptions -> MultiFileOptions). I've also moved the classes to a separate folder (common/multi_file) and split up files. The column mapping logic has been moved from the MultiFileReader to a separate class - the MultiFileColumnMapper.

MultiFileReaderData is dead. Long live MultiFileReaderData.

The previously named MultiFileReaderData is gone - but the unfortunately named MultiFileFileReaderData has now been renamed to MultiFileReaderData.

…apping

… FinalizeChunk

…ader

…nstead of MultiFileReader

MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)

Mytherin added 28 commits March 27, 2025 12:08

Move TableFilter from MultiFileReaderData into BaseFileReader

33bb4ea

InitializeReader - push MultiFileFileReaderData

76eabf1

Move MultiFileReaderData to separate file

58f29fb

Move constant map and expressions to MultiFileFileReaderData

aa54dbb

Remove is_empty flag

db840ac

Add test for filter mappings

f1c1eb3

We need to lock when evaluating a dynamic filter

d19fc81

WIP: remove cast map (TODO: need to cast filters if possible)

a54923c

Remove unused global state

aa1ed4e

Remove column mapping, extract global to local map from CreateColumnM…

fcd837a

…apping

WIP: push expression into reader for filters we cannot transform

1ac3f54

Fix indexes, add more tests

e2954f5

Use correct global/local types for filter conversion

737e81a

Turn cast_map back into expression map

71ae935

Print extended error message when conversion fails in MultiFileReader…

41d742d

… FinalizeChunk

Remove cast column reader

597a4e7

Clean-up: remove table_columns from BaseFileReader

51a7d5d

Remove MultiFileReaderData, move remaining properties into BaseFileRe…

5f8f1a2

…ader

Bring back the cast_map as an opt-in, and opt-in for the CSV reader

7e00890

Disable iceberg for now

3a332f0

Extract PushColumnMapping into a separate function

951d5ce

Move column mapping to separate column mapper class

64ce074

Generate files

d19a708

Move around multi file reader files

a033388

Simplify many type/file names by making the common prefix MultiFile i…

b9b0f65

…nstead of MultiFileReader

Generate

25bd61c

Add missing include

f27aa6e

Remove extra column

dd891fa

duckdb-draftbot marked this pull request as draft March 28, 2025 12:44

Mytherin marked this pull request as ready for review March 28, 2025 13:08

Patch spatial, fix include

74612c4

duckdb-draftbot marked this pull request as draft March 28, 2025 17:33

Mytherin marked this pull request as ready for review March 28, 2025 17:33

Mytherin merged commit 4739c3a into duckdb:main Mar 29, 2025
50 checks passed

Mytherin deleted the multifilereader branch April 2, 2025 09:23

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025

vendor: Update vendored sources to duckdb/duckdb@4739c3a

7889aad

MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025

vendor: Update vendored sources to duckdb/duckdb@4739c3a

30f047c

MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025

vendor: Update vendored sources to duckdb/duckdb@4739c3a

2dbb515

MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025

vendor: Update vendored sources to duckdb/duckdb@4739c3a

d5e7cd1

MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025

vendor: Update vendored sources to duckdb/duckdb@4739c3a

1788d15

MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers #16882

MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers #16882

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MultiFileReader Rework (part 17) - remove MultiFileReaderData - and move as much as possible out of the file readers #16882

MultiFileReader Rework (part 17) - remove MultiFileReaderData - and move as much as possible out of the file readers #16882

Uh oh!

Conversation

Uh oh!

Table Filters

Cast Map

Re-organization

MultiFileReaderData is dead. Long live MultiFileReaderData.

Uh oh!

Uh oh!

Uh oh!

MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers #16882

MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers #16882