-
Notifications
You must be signed in to change notification settings - Fork 2.4k
MultiFileReader Rework (part 17) - remove MultiFileReaderData
- and move as much as possible out of the file readers
#16882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…nstead of MultiFileReader
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 15, 2025
MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 15, 2025
MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 16, 2025
MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 16, 2025
MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 17, 2025
MultiFileReader Rework (part 17) - remove `MultiFileReaderData` - and move as much as possible out of the file readers (duckdb/duckdb#16882)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is a follow-up of #16838
The above PR laid the ground work for removing the cast map and global column definitions from the file reader. This PR finalizes that work and (almost*) removes the need for the base file reader to know about the existence of the multi file reader and its mysterious constructs.
The
MultiFileReaderData
is now gone, and everything that used to be there has been moved into theBaseFileReader
instead:Table Filters
Table filters form a problem when pushing down into the scanners, because of a potential type mismatch. Table filters are defined in the global type, since that is the type that is visible in the query. As such, we cannot push them down into the files columns without converting the filter.
What's more - it is not always possible to convert this filter. This applies to comparisons - but becomes more obvious when expressions are involved in filters - e.g. the filter
substring(col, 1, 3) = '100'
clearly requirescol
to be of typeVARCHAR
- we cannot transform this into a filter on anINTEGER
column.This PR attempts to push down table filters when possible and performing the type conversion. This is generally possible with integer types (i.e. we can convert a filter on type
INT64
to a filter on typeINT32
). This happens in this code snippet:However, we need to have a fallback when the conversion is not possible. What happens in this case is that we push the transformation expression into a (new) expression_map. This expression must be evaluated in the reader prior to executing the filter:
While this sounds more complex than the cast map - this approach has several advantages:
Cast Map
The cast_map has not been removed, but is not used by default. It still exists, and readers can opt in to using it by setting the
UseCastMap()
flag. This is only used by the CSV reader.The reason this is relevant for the CSV reader is that, unlike e.g. Parquet, CSV files don't have "strict" types - so we actually want types to be pushed down into the CSV reader directly. This prevents us from inferring the wrong types and throwing errors unnecessarily. In addition, the CSV reader can also directly perform casts while reading which makes this faster.
Re-organization
In order to make reviewing as hard as possible - I've also reorganized the folders and class names of the various multi file reader classes. For many classes, I've made the prefix
MultiFile
instead ofMultiFileReader
(e.g.MultiFileReaderOptions
->MultiFileOptions
). I've also moved the classes to a separate folder (common/multi_file
) and split up files. The column mapping logic has been moved from theMultiFileReader
to a separate class - theMultiFileColumnMapper
.MultiFileReaderData is dead. Long live MultiFileReaderData.
The previously named
MultiFileReaderData
is gone - but the unfortunately namedMultiFileFileReaderData
has now been renamed toMultiFileReaderData
.