8000 Leverage `VectorType` in `ColumnDataCollection` by lnkuiper · Pull Request #17881 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Leverage VectorType in ColumnDataCollection #17881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 12, 2025

Conversation

lnkuiper
Copy link
Contributor

When materializing data in columnar format in the ColumnDataCollection, we (almost) always go to the FLAT representation, i.e., CONSTANT and DICTIONARY Vectors are flattened. For most data types this is OK, but for strings, this can cause the size of the intermediates to increase drastically.

Keeping the VectorType the same would be ideal, but when we scan, we expect the Vectors to be FLAT, so this would be a big (and bug-prone) effort, so it's not really feasible (for now). Instead, this PR copies over just the unique strings, and creates a FLAT Vector of strings that point into it. This solves most of the problem while keeping a very low code footprint.

This optimization will trigger always for CONSTANT Vectors, but for DICTIONARY Vectors only if the dictionary size is less than half of our STANDARD_VECTOR_SIZE. The dictionary is copied once per Vector. This can be improved in the future by copying the dictionary just once. This is totally feasible, but more effort to implement. What I've implemented here are the first steps in that direction, but I've chosen to keep this PR simple for now, as I think this solves 80% of the problem with just 20% of the effort.

I did a quick benchmark by copying the l_shipinstruct column from lineitem from TPC-H at SF100 from Parquet to Parquet, and this reduced the peak RSS from 5.1 GB to 3.5 GB. The speed was pretty much the same. I can come up with a lot more degenerate examples where memory usage would be reduced by orders of magnitude.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 12, 2025 07:15
@lnkuiper lnkuiper marked this pull request as ready for review June 12, 2025 07:15
@Mytherin Mytherin merged commit fd1b726 into duckdb:main Jun 12, 2025
52 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0