8000 MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers by Mytherin · Pull Request #16882 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

MultiFileReader Rework (part 17) - remove MultiFileReaderData - and move as much as possible out of the file readers #16882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Mar 29, 2025

Conversation

Mytherin
Copy link
Collaborator
@Mytherin Mytherin commented Mar 28, 2025

This PR is a follow-up of #16838

The above PR laid the ground work for removing the cast map and global column definitions from the file reader. This PR finalizes that work and (almost*) removes the need for the base file reader to know about the existence of the multi file reader and its mysterious constructs.

The MultiFileReaderData is now gone, and everything that used to be there has been moved into the BaseFileReader instead:

class BaseFileReader {
public:
    //! The name of the file we are reading
    string file_name;
    //! (Optionally) The file index (generated by the multi file reader)
    optional_idx file_list_idx;
    //! The set of columns for the current file
    vector<MultiFileColumnDefinition> columns;
    //! The column ids to read from the file
    MultiFileLocalColumnIds<MultiFileLocalColumnId> column_ids;
    //! The column indexes to read from the file
    vector<ColumnIndex> column_indexes;
    //! The set of table filters (adjusted to local indexes)
    unique_ptr<TableFilterSet> filters;
    //! Expression to execute for a given column (BEFORE executing the filter)
    //! NOTE: this is only set when we have filters - it can be ignored for readers that don't have filter pushdown
    unordered_map<column_t, unique_ptr<Expression>> expression_map;
    //! The final types for various expressions - this is ONLY used if UseCastMap() is explicitly enabled
    unordered_map<column_t, LogicalType> cast_map;
};

Table Filters

Table filters form a problem when pushing down into the scanners, because of a potential type mismatch. Table filters are defined in the global type, since that is the type that is visible in the query. As such, we cannot push them down into the files columns without converting the filter.

What's more - it is not always possible to convert this filter. This applies to comparisons - but becomes more obvious when expressions are involved in filters - e.g. the filter substring(col, 1, 3) = '100' clearly requires col to be of type VARCHAR - we cannot transform this into a filter on an INTEGER column.

This PR attempts to push down table filters when possible and performing the type conversion. This is generally possible with integer types (i.e. we can convert a filter on type INT64 to a filter on type INT32). This happens in this code snippet:

unique_ptr<TableFilter> local_filter;
if (local_type == global_type) {
    // no conversion required - just copy the filter
    local_filter = global_filter.Copy();
} else {
    // types are different - try to convert
    // first check if we can safely convert (i.e. if the conversion is lossless would not change the result)
    if (StatisticsPropagator::CanPropagateCast(local_type, global_type)) {
        // if we can convert - try to actually convert
        local_filter = TryCastTableFilter(global_filter, map_entry.mapping, local_type);
    }
}

However, we need to have a fallback when the conversion is not possible. What happens in this case is that we push the transformation expression into a (new) expression_map. This expression must be evaluated in the reader prior to executing the filter:

// add the expression to the expression map - we are now evaluating this inside the reader directly
// we need to set the index of the references inside the expression to 0
SetIndexToZero(*reader_data.expressions[local_id]);
reader.expression_map[filter_idx] = std::move(reader_data.expressions[local_id]);

// reset the expression - since we are evaluating it in the reader we can just reference it
reader_data.expressions[local_id] = make_uniq<BoundReferenceExpression>(global_type, local_id);

While this sounds more complex than the cast map - this approach has several advantages:

  • The table filter transformation is required to take advantage of table filters for e.g. row group pruning. Previously this logic existed in the Parquet reader - this is now generalized to the multi file reader.
  • By using an expression map instead of a cast map we can support more complex expressions - this is not required yet in this PR but will be required going forward (for e.g. struct evolution)
  • The expression_map is only used when filters are enabled - this is currently only the case for the Parquet reader. All other readers are simplified and don't need to worry about this at all.

Cast Map

The cast_map has not been removed, but is not used by default. It still exists, and readers can opt in to using it by setting the UseCastMap() flag. This is only used by the CSV reader.

The reason this is relevant for the CSV reader is that, unlike e.g. Parquet, CSV files don't have "strict" types - so we actually want types to be pushed down into the CSV reader directly. This prevents us from inferring the wrong types and throwing errors unnecessarily. In addition, the CSV reader can also directly perform casts while reading which makes this faster.

Re-organization

In order to make reviewing as hard as possible - I've also reorganized the folders and class names of the various multi file reader classes. For many classes, I've made the prefix MultiFile instead of MultiFileReader (e.g. MultiFileReaderOptions -> MultiFileOptions). I've also moved the classes to a separate folder (common/multi_file) and split up files. The column mapping logic has been moved from the MultiFileReader to a separate class - the MultiFileColumnMapper.

MultiFileReaderData is dead. Long live MultiFileReaderData.

The previously named MultiFileReaderData is gone - but the unfortunately named MultiFileFileReaderData has now been renamed to MultiFileReaderData.

Mytherin added 28 commits March 27, 2025 12:08
@duckdb-draftbot duckdb-draftbot marked this pull request as draft March 28, 2025 12:44
@Mytherin Mytherin marked this pull request as ready for review March 28, 2025 13:08
@duckdb-draftbot duckdb-draftbot marked this pull request as draft March 28, 2025 17:33
@Mytherin Mytherin marked this pull request as ready for review March 28, 2025 17:33
@Mytherin Mytherin merged commit 4739c3a into duckdb:main Mar 29, 2025
50 checks passed
@Mytherin Mytherin deleted the multifilereader branch April 2, 2025 09:23
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025
MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0