8000 [MultiFileReader] Create "local" filters to hand to underlying readers by Tishj · Pull Request #16838 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[MultiFileReader] Create "local" filters to hand to underlying readers #16838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Mar 27, 2025

Conversation

Tishj
Copy link
Contributor
@Tishj Tishj commented Mar 26, 2025

This PR is a follow up to #16630

The aim of this PR is to not require the underlying reader to make use of complicated mappings created in the MultiFileReaderData to correctly handle filters pushed into the reader.

Previously the TableFilterSet handed to the underlying reader contained "global" filters, i.e the column_ids referenced the "global" schema, which is not guaranteed to match the schema of the file that the reader is reading.

To convert these into the "local" schema (that of the file that is being read), the filter_map had to be used, then the filter entry had to be checked to see if the filter points to a column that is present in the file being read, if it's not then the filter points to the entry in the constant_map that should be used instead.


After this PR, the underlying reader is handed "local" filters, that already contain column_ids that reference the "local" schema.
No filter will be found that references columns that are not present in the file, as they have been evaluated beforehand.

These filters (the ones targeting columns that are not in the file) can be safely removed because if the filter doesn't match we skip the file entirely, it does not have to be scanned. If the filter does match, it can be safely ignored by the reader.


Open questions / Areas for future work discovered

Copied child_indexes issue

The child_indexes of the global ColumnIndex are copied 1-to-1 in the creation of the mapping still:

			//! FIXME: local fields are not guaranteed to match with the global fields for this struct
			local_index = ColumnIndex(local_id.GetId(), global_id.GetChildIndexes());
		}

As the fixme says, this probably shouldn't happen, because schema evolution could create a divide between the global and local fields for a given struct.

StructFilter schema evolution

The StructFilter besides a child_idx (which suffers from the issue mentioned above) also has a child_name.
Through schema evolution the field could be renamed, this information should probably be added to the MultiFileReaderColumnDefinition to properly handle renames of struct fields.
Or that information is already present, through the LogicalType attached to the column definition, but it needs to be used properly.

cast_map casts

The cast_map is still present, and necessary to be respected by the underlying reader, because delaying this cast would complicate the filtering logic.

If we look at the following flow:
MFR <-> READER <-> FILE

Taking the parquet scanner as an example;
Currently, the cast_map is applied in the translation from the FILE to the READER, through a CastColumnReader.
This reader isn't anything special, it just applies a DefaultCastAs to the vector read from the reader it wraps, this can easily be replaced by a BoundCastExpression that is applied in the FinalizeChunk.

The problem with moving this cast up, is that it would be applied in the translation from READER to the MFR.
This is a problem because the filter pushdown is done inside the READER (after the cast is performed), and so the filters are using (and expecting) the "global" type, the target of the cast.

To move the cast up into the translation from READER to the MFR the filters that are sent to the reader would need to be changed to work on the "local" type instead, which is not an easy refactor.

Tishj added 27 commits March 19, 2025 13:14
…to be evaluated here, skipping the file in its entirety if the filter is false
… an optional filter, dynamic filters cant be copied correctly, so this circumvents the errors that arise from that
…o we need to wrap our child_mappings in unique_ptrs
Copy link
Collaborator
@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM - one minor comment

@duckdb-draftbot duckdb-draftbot marked this pull request as draft March 26, 2025 15:54
@Mytherin Mytherin marked this pull request as ready for review March 26, 2025 21:55
@Mytherin Mytherin merged commit edeb947 into duckdb:main Mar 27, 2025
46 of 50 checks passed
@Mytherin
Copy link
Collaborator

Thanks! LGTM - failures are unrelated

Mytherin added a commit that referenced this pull request Mar 29, 2025
… move as much as possible out of the file readers (#16882)

This PR is a follow-up of #16838

The above PR laid the ground work for removing the cast map and global
column definitions from the file reader. This PR finalizes that work and
(almost*) removes the need for the base file reader to know about the
existence of the multi file reader and its mysterious constructs.


The `MultiFileReaderData` is now gone, and everything that used to be
there has been moved into the `BaseFileReader` instead:

```cpp
class BaseFileReader {
public:
    //! The name of the file we are reading
    string file_name;
    //! (Optionally) The file index (generated by the multi file reader)
    optional_idx file_list_idx;
    //! The set of columns for the current file
    vector<MultiFileColumnDefinition> columns;
    //! The column ids to read from the file
    MultiFileLocalColumnIds<MultiFileLocalColumnId> column_ids;
    //! The column indexes to read from the file
    vector<ColumnIndex> column_indexes;
    //! The set of table filters (adjusted to local indexes)
    unique_ptr<TableFilterSet> filters;
    //! Expression to execute for a given column (BEFORE executing the filter)
    //! NOTE: this is only set when we have filters - it can be ignored for readers that don't have filter pushdown
    unordered_map<column_t, unique_ptr<Expression>> expression_map;
    //! The final types for various expressions - this is ONLY used if UseCastMap() is explicitly enabled
    unordered_map<column_t, LogicalType> cast_map;
};
```

### Table Filters

Table filters form a problem when pushing down into the scanners,
because of a potential type mismatch. Table filters are defined in the
*global* type, since that is the type that is visible in the query. As
such, we cannot push them down into the files columns without converting
the filter.

What's more - it is not always possible to convert this filter. This
applies to comparisons - but becomes more obvious when expressions are
involved in filters - e.g. the filter `substring(col, 1, 3) = '100'`
clearly requires `col` to be of type `VARCHAR` - we cannot transform
this into a filter on an `INTEGER` column.

This PR attempts to push down table filters when possible and performing
the type conversion. This is generally possible with integer types (i.e.
we can convert a filter on type `INT64` to a filter on type `INT32`).
This happens in this code snippet:

```cpp
unique_ptr<TableFilter> local_filter;
if (local_type == global_type) {
    // no conversion required - just copy the filter
    local_filter = global_filter.Copy();
} else {
    // types are different - try to convert
    // first check if we can safely convert (i.e. if the conversion is lossless would not change the result)
    if (StatisticsPropagator::CanPropagateCast(local_type, global_type)) {
        // if we can convert - try to actually convert
        local_filter = TryCastTableFilter(global_filter, map_entry.mapping, local_type);
    }
}
```

However, we need to have a fallback when the conversion is not possible.
What happens in this case is that we push the transformation
*expression* into a (new) **expression_map**. This expression must be
evaluated in the reader prior to executing the filter:

```cpp
// add the expression to the expression map - we are now evaluating this inside the reader directly
// we need to set the index of the references inside the expression to 0
SetIndexToZero(*reader_data.expressions[local_id]);
reader.expression_map[filter_idx] = std::move(reader_data.expressions[local_id]);

// reset the expression - since we are evaluating it in the reader we can just reference it
reader_data.expressions[local_id] = make_uniq<BoundReferenceExpression>(global_type, local_id);
```

While this sounds more complex than the cast map - this approach has
several advantages:

* The table filter transformation is required to take advantage of table
filters for e.g. row group pruning. Previously this logic existed in the
Parquet reader - this is now generalized to the multi file reader.
* By using an expression map instead of a cast map we can support more
complex expressions - this is not required yet in this PR but will be
required going forward (for e.g. struct evolution)
* The expression_map is only used when filters are enabled - this is
currently only the case for the Parquet reader. All other readers are
simplified and don't need to worry about this at all.

### Cast Map
 
The cast_map has **not** been removed, but is not used by default. It
still exists, and readers can opt in to using it by setting the
`UseCastMap()` flag. This is only used by the CSV reader.

The reason this is relevant for the CSV reader is that, unlike e.g.
Parquet, CSV files don't have "strict" types - so we actually want types
to be pushed down into the CSV reader directly. This prevents us from
inferring the wrong types and throwing errors unnecessarily. In
addition, the CSV reader can also directly perform casts while reading
which makes this faster.

### Re-organization

In order to make reviewing as hard as possible - I've also reorganized
the folders and class names of the various multi file reader classes.
For many classes, I've made the prefix `MultiFile` instead of
`MultiFileReader` (e.g. `MultiFileReaderOptions` -> `MultiFileOptions`).
I've also moved the classes to a separate folder (`common/multi_file`)
and split up files. The column mapping logic has been moved from the
`MultiFileReader` to a separate class - the `MultiFileColumnMapper`.

### MultiFileReaderData is dead. Long live MultiFileReaderData.

The previously named `MultiFileReaderData` is gone - but the
unfortunately named `MultiFileFileReaderData` has now been renamed to
`MultiFileReaderData`.
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
[MultiFileReader] Create "local" filters to hand to underlying readers (duckdb/duckdb#16838)
 Fix Python docstrings for unique (duckdb/duckdb#16845)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
[MultiFileReader] Create "local" filters to hand to underlying readers (duckdb/duckdb#16838)
 Fix Python docstrings for unique (duckdb/duckdb#16845)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
[MultiFileReader] Create "local" filters to hand to underlying readers (duckdb/duckdb#16838)
 Fix Python docstrings for unique (duckdb/duckdb#16845)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
[MultiFileReader] Create "local" filters to hand to underlying readers (duckdb/duckdb#16838)
 Fix Python docstrings for unique (duckdb/duckdb#16845)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
[MultiFileReader] Create "local" filters to hand to underlying readers (duckdb/duckdb#16838)
 Fix Python docstrings for unique (duckdb/duckdb#16845)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025
[MultiFileReader] Create "local" filters to hand to underlying readers (duckdb/duckdb#16838)
 Fix Python docstrings for unique (duckdb/duckdb#16845)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3BAF
Development

Successfully merging this pull request may close these issues.

2 participants
0