-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Parquet Reader: emit partition stats for any files that have cached metadata, and implement ListFilesExtended
that adds extra info to files globbed
#17344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…s extra information through the glob (last_modified_time, etc)
CC @Tishj @samansmink @Tmonster I think the SET parquet_metadata_cache=true;
SELECT COUNT(*) FROM table_with_deletions;
SELECT COUNT(*) FROM table_with_deletions; |
Mytherin
added a commit
that referenced
this pull request
May 5, 2025
#17365) Follow-up from #17344 When emitting partition stats based on cached Parquet files, we check the `OpenFileInfo` for a boolean `has_deletes`. If that is set to true - we skip emitting partition stats. This should allow the partition stats from cached Parquet files to be used also for Lakehouse formats when there are no deletes for a given file. CC @Tishj @samansmink @Tmonster
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 18, 2025
Parquet Reader: emit partition stats for any files that have cached metadata, and implement `ListFilesExtended` that adds extra info to files globbed (duckdb/duckdb#17344)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 18, 2025
Parquet Reader: emit partition stats for any files that have cached metadata, and implement `ListFilesExtended` that adds extra info to files globbed (duckdb/duckdb#17344)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 19, 2025
Parquet Reader: emit partition stats for any files that have cached metadata, and implement `ListFilesExtended` that adds extra info to files globbed (duckdb/duckdb#17344)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 19, 2025
Parquet Reader: emit partition stats for any files that have cached metadata, and implement `ListFilesExtended` that adds extra info to files globbed (duckdb/duckdb#17344)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR modifies the Parquet reader to support
GetPartitionStats
when either (1) reading a single file, or (2) when all files that are read have their metadata cached. This allows us to emit partition data from Parquet files, also when globbing many files.In order to avoid the cache invalidation from being expensive during this check, we perform the cache invalidation only using the
OpenFileInfo
objects that we have received from the multi file list - i.e. no additional file system calls are performed. This builds on the work in e.g. duckdb/duckdb-httpfs#45 which alllows us to find the relevant information for performing cache invalidation during globbing and pushing that through all the relevant layers.ListFilesExtended
In order to make this work also for local file systems, a new
ListFiles
method is introduced that returnsOpenFileInfo
objects instead ofstring
objects:For backwards compatibility with other file systems, this method only needs to be optionally implemented through the
ListFilesExtended
method whenSupportsListFilesExtended
is overloaded to support true. This allows file systems to optionally implement either the extended callback or the older method.This PR then implements the
ListFilesExtended
callback for theLocalFileSystem
:stat
call per file to find out if it was a directory - we now also read the file size (st_size
) and last modified time (st_mtime
) from the result.WIN32_FIND_DATAW
struct for each file that contains the relevant information.Performance
GetPartitionStats
is currently only used forCOUNT(*)
(see #15301) - although there are plans to have this used in more scenarios.Single File COUNT(*)
When reading a single file, we always emit
GetPartitionStats
also without metadata caching enabled - since we have already read the metadata as part of the initial schema detection step. This allowsCOUNT(*)
to always be faster in this scenario using these changes:Cached COUNT(*)
When enabling metadata caching, we can see that the cached metadata is used to accelarate
COUNT(*)
: