-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[MultiFileReader] Create "local" filters to hand to underlying readers #16838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Mytherin
merged 30 commits into
duckdb:main
from
Tishj:multi_file_reader_create_filters
Mar 27, 2025
Merged
[MultiFileReader] Create "local" filters to hand to underlying readers #16838
Mytherin
merged 30 commits into
duckdb:main
from
Tishj:multi_file_reader_create_filters
Mar 27, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…to be evaluated here, skipping the file in its entirety if the filter is false
…e multi file reader function
… filters on constants failing
…rmation is missing?
… accidentally removed
… - per local state
… an optional filter, dynamic filters cant be copied correctly, so this circumvents the errors that arise from that
…o we need to wrap our child_mappings in unique_ptrs
…rit from TableFilter..
Mytherin
reviewed
Mar 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM - one minor comment
…onal and dynamic filter conversion logic back for the MultiFileReader
Thanks! LGTM - failures are unrelated |
Mytherin
added a commit
that referenced
this pull request
Mar 29, 2025
… move as much as possible out of the file readers (#16882) This PR is a follow-up of #16838 The above PR laid the ground work for removing the cast map and global column definitions from the file reader. This PR finalizes that work and (almost*) removes the need for the base file reader to know about the existence of the multi file reader and its mysterious constructs. The `MultiFileReaderData` is now gone, and everything that used to be there has been moved into the `BaseFileReader` instead: ```cpp class BaseFileReader { public: //! The name of the file we are reading string file_name; //! (Optionally) The file index (generated by the multi file reader) optional_idx file_list_idx; //! The set of columns for the current file vector<MultiFileColumnDefinition> columns; //! The column ids to read from the file MultiFileLocalColumnIds<MultiFileLocalColumnId> column_ids; //! The column indexes to read from the file vector<ColumnIndex> column_indexes; //! The set of table filters (adjusted to local indexes) unique_ptr<TableFilterSet> filters; //! Expression to execute for a given column (BEFORE executing the filter) //! NOTE: this is only set when we have filters - it can be ignored for readers that don't have filter pushdown unordered_map<column_t, unique_ptr<Expression>> expression_map; //! The final types for various expressions - this is ONLY used if UseCastMap() is explicitly enabled unordered_map<column_t, LogicalType> cast_map; }; ``` ### Table Filters Table filters form a problem when pushing down into the scanners, because of a potential type mismatch. Table filters are defined in the *global* type, since that is the type that is visible in the query. As such, we cannot push them down into the files columns without converting the filter. What's more - it is not always possible to convert this filter. This applies to comparisons - but becomes more obvious when expressions are involved in filters - e.g. the filter `substring(col, 1, 3) = '100'` clearly requires `col` to be of type `VARCHAR` - we cannot transform this into a filter on an `INTEGER` column. This PR attempts to push down table filters when possible and performing the type conversion. This is generally possible with integer types (i.e. we can convert a filter on type `INT64` to a filter on type `INT32`). This happens in this code snippet: ```cpp unique_ptr<TableFilter> local_filter; if (local_type == global_type) { // no conversion required - just copy the filter local_filter = global_filter.Copy(); } else { // types are different - try to convert // first check if we can safely convert (i.e. if the conversion is lossless would not change the result) if (StatisticsPropagator::CanPropagateCast(local_type, global_type)) { // if we can convert - try to actually convert local_filter = TryCastTableFilter(global_filter, map_entry.mapping, local_type); } } ``` However, we need to have a fallback when the conversion is not possible. What happens in this case is that we push the transformation *expression* into a (new) **expression_map**. This expression must be evaluated in the reader prior to executing the filter: ```cpp // add the expression to the expression map - we are now evaluating this inside the reader directly // we need to set the index of the references inside the expression to 0 SetIndexToZero(*reader_data.expressions[local_id]); reader.expression_map[filter_idx] = std::move(reader_data.expressions[local_id]); // reset the expression - since we are evaluating it in the reader we can just reference it reader_data.expressions[local_id] = make_uniq<BoundReferenceExpression>(global_type, local_id); ``` While this sounds more complex than the cast map - this approach has several advantages: * The table filter transformation is required to take advantage of table filters for e.g. row group pruning. Previously this logic existed in the Parquet reader - this is now generalized to the multi file reader. * By using an expression map instead of a cast map we can support more complex expressions - this is not required yet in this PR but will be required going forward (for e.g. struct evolution) * The expression_map is only used when filters are enabled - this is currently only the case for the Parquet reader. All other readers are simplified and don't need to worry about this at all. ### Cast Map The cast_map has **not** been removed, but is not used by default. It still exists, and readers can opt in to using it by setting the `UseCastMap()` flag. This is only used by the CSV reader. The reason this is relevant for the CSV reader is that, unlike e.g. Parquet, CSV files don't have "strict" types - so we actually want types to be pushed down into the CSV reader directly. This prevents us from inferring the wrong types and throwing errors unnecessarily. In addition, the CSV reader can also directly perform casts while reading which makes this faster. ### Re-organization In order to make reviewing as hard as possible - I've also reorganized the folders and class names of the various multi file reader classes. For many classes, I've made the prefix `MultiFile` instead of `MultiFileReader` (e.g. `MultiFileReaderOptions` -> `MultiFileOptions`). I've also moved the classes to a separate folder (`common/multi_file`) and split up files. The column mapping logic has been moved from the `MultiFileReader` to a separate class - the `MultiFileColumnMapper`. ### MultiFileReaderData is dead. Long live MultiFileReaderData. The previously named `MultiFileReaderData` is gone - but the unfortunately named `MultiFileFileReaderData` has now been renamed to `MultiFileReaderData`.
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 15, 2025
[MultiFileReader] Create "local" filters to hand to underlying readers (duckdb/duckdb#16838) Fix Python docstrings for unique (duckdb/duckdb#16845)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 15, 2025
[MultiFileReader] Create "local" filters to hand to underlying readers (duckdb/duckdb#16838) Fix Python docstrings for unique (duckdb/duckdb#16845)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 16, 2025
[MultiFileReader] Create "local" filters to hand to underlying readers (duckdb/duckdb#16838) Fix Python docstrings for unique (duckdb/duckdb#16845)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 16, 2025
[MultiFileReader] Create "local" filters to hand to underlying readers (duckdb/duckdb#16838) Fix Python docstrings for unique (duckdb/duckdb#16845)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 16, 2025
[MultiFileReader] Create "local" filters to hand to underlying readers (duckdb/duckdb#16838) Fix Python docstrings for unique (duckdb/duckdb#16845)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 17, 2025
[MultiFileReader] Create "local" filters to hand to underlying readers (duckdb/duckdb#16838) Fix Python docstrings for unique (duckdb/duckdb#16845)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is a follow up to #16630
The aim of this PR is to not require the underlying reader to make use of complicated mappings created in the
MultiFileReaderData
to correctly handle filters pushed into the reader.Previously the
TableFilterSet
handed to the underlying reader contained "global" filters, i.e thecolumn_ids
referenced the "global" schema, which is not guaranteed to match the schema of the file that the reader is reading.To convert these into the "local" schema (that of the file that is being read), the
filter_map
had to be used, then the filter entry had to be checked to see if the filter points to a column that is present in the file being read, if it's not then the filter points to the entry in theconstant_map
that should be used instead.After this PR, the underlying reader is handed "local" filters, that already contain
column_ids
that reference the "local" schema.No filter will be found that references columns that are not present in the file, as they have been evaluated beforehand.
These filters (the ones targeting columns that are not in the file) can be safely removed because if the filter doesn't match we skip the file entirely, it does not have to be scanned. If the filter does match, it can be safely ignored by the reader.
Open questions / Areas for future work discovered
Copied
child_indexes
issueThe
child_indexes
of the globalColumnIndex
are copied 1-to-1 in the creation of the mapping still://! FIXME: local fields are not guaranteed to match with the global fields for this struct local_index = ColumnIndex(local_id.GetId(), global_id.GetChildIndexes()); }
As the fixme says, this probably shouldn't happen, because schema evolution could create a divide between the global and local fields for a given struct.
StructFilter
schema evolutionThe StructFilter besides a
child_idx
(which suffers from the issue mentioned above) also has achild_name
.Through schema evolution the field could be renamed, this information should probably be added to the
MultiFileReaderColumnDefinition
to properly handle renames of struct fields.Or that information is already present, through the LogicalType attached to the column definition, but it needs to be used properly.
cast_map
castsThe
cast_map
is still present, and necessary to be respected by the underlying reader, because delaying this cast would complicate the filtering logic.If we look at the following flow:
MFR <-> READER <-> FILE
Taking the parquet scanner as an example;
Currently, the
cast_map
is applied in the translation from theFILE
to theREADER
, through a CastColumnReader.This reader isn't anything special, it just applies a DefaultCastAs to the vector read from the reader it wraps, this can easily be replaced by a BoundCastExpression that is applied in the
FinalizeChunk
.The problem with moving this cast up, is that it would be applied in the translation from
READER
to theMFR
.This is a problem because the filter pushdown is done inside the READER (after the cast is performed), and so the filters are using (and expecting) the "global" type, the target of the cast.
To move the cast up into the translation from
READER
to theMFR
the filters that are sent to the reader would need to be changed to work on the "local" type instead, which is not an easy refactor.