8000 Grow string dictionary dynamically in Parquet writer by lnkuiper · Pull Request #17061 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Grow string dictionary dynamically in Parquet writer #17061

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 12, 2025

Conversation

lnkuiper
Copy link
Contributor

The PrimitiveDictionary that we've added for dictionary compression in Parquet (only on main now, and will be released in v1.3.0) would statically allocate STRING_DICTIONARY_PAGE_SIZE_LIMIT bytes for the strings. It defaults to 1 MB. If the total string size exceeds this, DuckDB bails on dictionary compression.

This is a bit restrictive, and does not work well when there are many long strings that aren't unique. This PR makes it so that we initialize at 1 MB, and then we dynamically grow until STRING_DICTIONARY_PAGE_SIZE_LIMIT, which now defaults to 1 GB. This allows large strings to still be dictionary-compressed by our writer.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 10, 2025 12:55
@lnkuiper lnkuiper marked this pull request as ready for review April 10, 2025 12:55
@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 11, 2025 07:38
@lnkuiper lnkuiper marked this pull request as ready for review April 11, 2025 07:38
@Mytherin Mytherin merged commit 5538f96 into duckdb:main Apr 12, 2025
54 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Grow string dictionary dynamically in Parquet writer (duckdb/duckdb#17061)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Grow string dictionary dynamically in Parquet writer (duckdb/duckdb#17061)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Grow string dictionary dynamically in Parquet writer (duckdb/duckdb#17061)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
Grow string dictionary dynamically in Parquet writer (duckdb/duckdb#17061)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0