8000 Parquet Reader: emit partition stats for any files that have cached metadata, and implement `ListFilesExtended` that adds extra info to files globbed by Mytherin · Pull Request #17344 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Parquet Reader: emit partition stats for any files that have cached metadata, and implement ListFilesExtended that adds extra info to files globbed #17344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 3, 2025

Conversation

Mytherin
Copy link
Collaborator
@Mytherin Mytherin commented May 3, 2025

This PR modifies the Parquet reader to support GetPartitionStats when either (1) reading a single file, or (2) when all files that are read have their metadata cached. This allows us to emit partition data from Parquet files, also when globbing many files.

In order to avoid the cache invalidation from being expensive during this check, we perform the cache invalidation only using the OpenFileInfo objects that we have received from the multi file list - i.e. no additional file system calls are performed. This builds on the work in e.g. duckdb/duckdb-httpfs#45 which alllows us to find the relevant information for performing cache invalidation during globbing and pushing that through all the relevant layers.

ListFilesExtended

In order to make this work also for local file systems, a new ListFiles method is introduced that returns OpenFileInfo objects instead of string objects:

bool ListFiles(const string &directory, const std::function<void(OpenFileInfo &info)> &callback, 
               optional_ptr<FileOpener> opener = nullptr);

For backwards compatibility with other file systems, this method only needs to be optionally implemented through the ListFilesExtended method when SupportsListFilesExtended is overloaded to support true. This allows file systems to optionally implement either the extended callback or the older method.

virtual bool ListFilesExtended(const string &directory,
                               const std::function<void(OpenFileInfo &info)> &callback,
                               optional_ptr<FileOpener> opener);
virtual bool SupportsListFilesExtended() const;

This PR then implements the ListFilesExtended callback for the LocalFileSystem:

  • On Unix, we already did a stat call per file to find out if it was a directory - we now also read the file size (st_size) and last modified time (st_mtime) from the result.
  • On Windows, globbing already returns a WIN32_FIND_DATAW struct for each file that contains the relevant information.

Performance

GetPartitionStats is currently only used for COUNT(*) (see #15301) - although there are plans to have this used in more scenarios.

Single File COUNT(*)

When reading a single file, we always emit GetPartitionStats also without metadata caching enabled - since we have already read the metadata as part of the initial schema detection step. This allows COUNT(*) to always be faster in this scenario using these changes:

SELECT COUNT(*) FROM hits.parquet;
main new
0.054s 0.035s
Cached COUNT(*)

When enabling metadata caching, we can see that the cached metadata is used to accelarate COUNT(*):

SET parquet_metadata_cache=true;
SELECT COUNT(*) FROM 'hits_partitioned/*.parquet';
main new
0.048 0.003s

@Mytherin
Copy link
Collaborator Author
Mytherin commented May 3, 2025

CC @Tishj @samansmink @Tmonster I think the get_partition_stats added in this PR might have to be set to nullptr for Iceberg/Delta (or rewritten/modified), since it does not take deletions into account. After this is merged I would at least add a test like to verify that this does not result in incorrect results:

SET parquet_metadata_cache=true;
SELECT COUNT(*) FROM table_with_deletions;
SELECT COUNT(*) FROM table_with_deletions;

@Mytherin Mytherin merged commit a35d8b6 into duckdb:main May 3, 2025
47 checks passed
Mytherin added a commit that referenced this pull request May 5, 2025
#17365)

Follow-up from #17344

When emitting partition stats based on cached Parquet files, we check
the `OpenFileInfo` for a boolean `has_deletes`. If that is set to true -
we skip emitting partition stats. This should allow the partition stats
from cached Parquet files to be used also for Lakehouse formats when
there are no deletes for a given file.

CC @Tishj @samansmink @Tmonster
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Parquet Reader: emit partition stats for any files that have cached metadata, and implement `ListFilesExtended` that adds extra info to files globbed (duckdb/duckdb#17344)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Parquet Reader: emit partition stats for any files that have cached metadata, and implement `ListFilesExtended` that adds extra info to files globbed (duckdb/duckdb#17344)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
Parquet Reader: emit partition stats for any files that have cached metadata, and implement `ListFilesExtended` that adds extra info to files globbed (duckdb/duckdb#17344)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
Parquet Reader: emit partition stats for any files that have cached metadata, and implement `ListFilesExtended` that adds extra info to files globbed (duckdb/duckdb#17344)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0