[MultiFileReader] Create "local" filters to hand to underlying readers #16838

Tishj · 2025-03-26T09:15:20Z

This PR is a follow up to #16630

The aim of this PR is to not require the underlying reader to make use of complicated mappings created in the MultiFileReaderData to correctly handle filters pushed into the reader.

Previously the TableFilterSet handed to the underlying reader contained "global" filters, i.e the column_ids referenced the "global" schema, which is not guaranteed to match the schema of the file that the reader is reading.

To convert these into the "local" schema (that of the file that is being read), the filter_map had to be used, then the filter entry had to be checked to see if the filter points to a column that is present in the file being read, if it's not then the filter points to the entry in the constant_map that should be used instead.

After this PR, the underlying reader is handed "local" filters, that already contain column_ids that reference the "local" schema.
No filter will be found that references columns that are not present in the file, as they have been evaluated beforehand.

These filters (the ones targeting columns that are not in the file) can be safely removed because if the filter doesn't match we skip the file entirely, it does not have to be scanned. If the filter does match, it can be safely ignored by the reader.

Open questions / Areas for future work discovered

Copied `child_indexes` issue

The child_indexes of the global ColumnIndex are copied 1-to-1 in the creation of the mapping still:

			//! FIXME: local fields are not guaranteed to match with the global fields for this struct
			local_index = ColumnIndex(local_id.GetId(), global_id.GetChildIndexes());
		}

As the fixme says, this probably shouldn't happen, because schema evolution could create a divide between the global and local fields for a given struct.

`StructFilter` schema evolution

The StructFilter besides a child_idx (which suffers from the issue mentioned above) also has a child_name.
Through schema evolution the field could be renamed, this information should probably be added to the MultiFileReaderColumnDefinition to properly handle renames of struct fields.
Or that information is already present, through the LogicalType attached to the column definition, but it needs to be used properly.

`cast_map` casts

The cast_map is still present, and necessary to be respected by the underlying reader, because delaying this cast would complicate the filtering logic.

If we look at the following flow:
MFR <-> READER <-> FILE

Taking the parquet scanner as an example;
Currently, the cast_map is applied in the translation from the FILE to the READER, through a CastColumnReader.
This reader isn't anything special, it just applies a DefaultCastAs to the vector read from the reader it wraps, this can easily be replaced by a BoundCastExpression that is applied in the FinalizeChunk.

The problem with moving this cast up, is that it would be applied in the translation from READER to the MFR.
This is a problem because the filter pushdown is done inside the READER (after the cast is performed), and so the filters are using (and expecting) the "global" type, the target of the cast.

To move the cast up into the translation from READER to the MFR the filters that are sent to the reader would need to be changed to work on the "local" type instead, which is not an easy refactor.

…to be evaluated here, skipping the file in its entirety if the filter is false

…r_create_filters

…reate_filters

…e multi file reader function

…be read

… filters on constants failing

…rmation is missing?

… accidentally removed

…r_create_filters

… - per local state

… an optional filter, dynamic filters cant be copied correctly, so this circumvents the errors that arise from that

…ucts

…o we need to wrap our child_mappings in unique_ptrs

…r_create_filters

…laces

…rit from TableFilter..

…reate_filters

Mytherin

Thanks! LGTM - one minor comment

src/common/multi_file_reader.cpp

…reate_filters

…onal and dynamic filter conversion logic back for the MultiFileReader

… possible

Mytherin · 2025-03-27T08:41:00Z

Thanks! LGTM - failures are unrelated

… move as much as possible out of the file readers (#16882) This PR is a follow-up of #16838 The above PR laid the ground work for removing the cast map and global column definitions from the file reader. This PR finalizes that work and (almost*) removes the need for the base file reader to know about the existence of the multi file reader and its mysterious constructs. The `MultiFileReaderData` is now gone, and everything that used to be there has been moved into the `BaseFileReader` instead: ```cpp class BaseFileReader { public: //! The name of the file we are reading string file_name; //! (Optionally) The file index (generated by the multi file reader) optional_idx file_list_idx; //! The set of columns for the current file vector<MultiFileColumnDefinition> columns; //! The column ids to read from the file MultiFileLocalColumnIds<MultiFileLocalColumnId> column_ids; //! The column indexes to read from the file vector<ColumnIndex> column_indexes; //! The set of table filters (adjusted to local indexes) unique_ptr<TableFilterSet> filters; //! Expression to execute for a given column (BEFORE executing the filter) //! NOTE: this is only set when we have filters - it can be ignored for readers that don't have filter pushdown unordered_map<column_t, unique_ptr<Expression>> expression_map; //! The final types for various expressions - this is ONLY used if UseCastMap() is explicitly enabled unordered_map<column_t, LogicalType> cast_map; }; ``` ### Table Filters Table filters form a problem when pushing down into the scanners, because of a potential type mismatch. Table filters are defined in the *global* type, since that is the type that is visible in the query. As such, we cannot push them down into the files columns without converting the filter. What's more - it is not always possible to convert this filter. This applies to comparisons - but becomes more obvious when expressions are involved in filters - e.g. the filter `substring(col, 1, 3) = '100'` clearly requires `col` to be of type `VARCHAR` - we cannot transform this into a filter on an `INTEGER` column. This PR attempts to push down table filters when possible and performing the type conversion. This is generally possible with integer types (i.e. we can convert a filter on type `INT64` to a filter on type `INT32`). This happens in this code snippet: ```cpp unique_ptr<TableFilter> local_filter; if (local_type == global_type) { // no conversion required - just copy the filter local_filter = global_filter.Copy(); } else { // types are different - try to convert // first check if we can safely convert (i.e. if the conversion is lossless would not change the result) if (StatisticsPropagator::CanPropagateCast(local_type, global_type)) { // if we can convert - try to actually convert local_filter = TryCastTableFilter(global_filter, map_entry.mapping, local_type); } } ``` However, we need to have a fallback when the conversion is not possible. What happens in this case is that we push the transformation *expression* into a (new) **expression_map**. This expression must be evaluated in the reader prior to executing the filter: ```cpp // add the expression to the expression map - we are now evaluating this inside the reader directly // we need to set the index of the references inside the expression to 0 SetIndexToZero(*reader_data.expressions[local_id]); reader.expression_map[filter_idx] = std::move(reader_data.expressions[local_id]); // reset the expression - since we are evaluating it in the reader we can just reference it reader_data.expressions[local_id] = make_uniq<BoundReferenceExpression>(global_type, local_id); ``` While this sounds more complex than the cast map - this approach has several advantages: * The table filter transformation is required to take advantage of table filters for e.g. row group pruning. Previously this logic existed in the Parquet reader - this is now generalized to the multi file reader. * By using an expression map instead of a cast map we can support more complex expressions - this is not required yet in this PR but will be required going forward (for e.g. struct evolution) * The expression_map is only used when filters are enabled - this is currently only the case for the Parquet reader. All other readers are simplified and don't need to worry about this at all. ### Cast Map The cast_map has **not** been removed, but is not used by default. It still exists, and readers can opt in to using it by setting the `UseCastMap()` flag. This is only used by the CSV reader. The reason this is relevant for the CSV reader is that, unlike e.g. Parquet, CSV files don't have "strict" types - so we actually want types to be pushed down into the CSV reader directly. This prevents us from inferring the wrong types and throwing errors unnecessarily. In addition, the CSV reader can also directly perform casts while reading which makes this faster. ### Re-organization In order to make reviewing as hard as possible - I've also reorganized the folders and class names of the various multi file reader classes. For many classes, I've made the prefix `MultiFile` instead of `MultiFileReader` (e.g. `MultiFileReaderOptions` -> `MultiFileOptions`). I've also moved the classes to a separate folder (`common/multi_file`) and split up files. The column mapping logic has been moved from the `MultiFileReader` to a separate class - the `MultiFileColumnMapper`. ### MultiFileReaderData is dead. Long live MultiFileReaderData. The previously named `MultiFileReaderData` is gone - but the unfortunately named `MultiFileFileReaderData` has now been renamed to `MultiFileReaderData`.

[MultiFileReader] Create "local" filters to hand to underlying readers (duckdb/duckdb#16838) Fix Python docstrings for unique (duckdb/duckdb#16845)

Tishj added 27 commits March 19, 2025 13:14

WIP, filters for non-existant columns (not present in the file) need …

385940d

…to be evaluated here, skipping the file in its entirety if the filter is false

add a clarifying comment to TableFilterSet

0e4b932

Merge branch 'multi_file_reader_finalize_chunk' into multi_file_reade…

c4c8c94

…r_create_filters

Merge remote-tracking branch 'upstream/main' into multi_file_reader_c…

9d1c486

…reate_filters

add some semantics and assertions to the file get/opening logic in th…

6144376

…e multi file reader function

rewrite the method, its not unopened files, its files that can still …

7966778

…be read

move to CloseFile method, to be used when we skip a file based on the…

07c9969

… filters on constants failing

this might work?

cb09fbf

clean up some logic, attempted to map the nested fields but this info…

ce7ecc7

…rmation is missing?

add some fixmes for nested type support

96b8877

undo changes to TryInitializeNextBatch

841e62c

add SKIPPED enum to indicate that the file does not need to be opened

84bf7cb

might want to reinstate this definitely critical piece of code that I…

1fae223

… accidentally removed

Merge branch 'multi_file_reader_finalize_chunk' into multi_file_reade…

3b8d64c

…r_create_filters

initialize the adaptive filter and parquet scan filters once per file…

53d3e61

… - per local state

use the constant_map to find the constant for the column

c053ffe

skip optional filters, dynamic filters are often (always?) wrapped in…

09ea421

… an optional filter, dynamic filters cant be copied correctly, so this circumvents the errors that arise from that

comment out the stub I started to work on mapping child fields of str…

0bdef5d

…ucts

regenerate enum utils

8274c76

on linux, unordered_map's V has to be fully defined when it's used, s…

b647e10

…o we need to wrap our child_mappings in unique_ptrs

add missing file

cb5be65

Merge branch 'multi_file_reader_finalize_chunk' into multi_file_reade…

069ba94

…r_create_filters

deal with the fact that the method can return null in the remaining p…

91606e5

…laces

fix

52c63bc

this needs a move, because ConjunctionAndFilter doesn't directly inhe…

88424ae

…rit from TableFilter..

Merge remote-tracking branch 'upstream/main' into multi_file_reader_c…

39162a0

…reate_filters

Merge remote-tracking branch 'upstream/main' into multi_file_reader_c…

5599914

…reate_filters

Mytherin reviewed Mar 26, 2025

View reviewed changes

src/common/multi_file_reader.cpp Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into multi_file_reader_c…

a9ab735

…reate_filters

duckdb-draftbot marked this pull request as draft March 26, 2025 15:54

Tishj added 2 commits March 26, 2025 20:22

adjust dynamic filter, it can only be a constant filter, add the opti…

13fd65c

…onal and dynamic filter conversion logic back for the MultiFileReader

evaluate the optional and dynamic filter for the constant columns, if…

3d77806

… possible

Mytherin marked this pull request as ready for review March 26, 2025 21:55

Mytherin merged commit edeb947 into duckdb:main Mar 27, 2025
46 of 50 checks passed

Mytherin mentioned this pull request Mar 28, 2025

MultiFileReader Rework (part 17) - remove MultiFileReaderData - and move as much as possible out of the file readers #16882

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MultiFileReader] Create "local" filters to hand to underlying readers #16838

[MultiFileReader] Create "local" filters to hand to underlying readers #16838

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[MultiFileReader] Create "local" filters to hand to underlying readers #16838

[MultiFileReader] Create "local" filters to hand to underlying readers #16838

Uh oh!

Conversation

Uh oh!

Open questions / Areas for future work discovered

Copied child_indexes issue

StructFilter schema evolution

cast_map casts

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copied `child_indexes` issue

`StructFilter` schema evolution

`cast_map` casts