Leverage `VectorType` in `ColumnDataCollection` #17881

lnkuiper · 2025-06-11T13:10:45Z

When materializing data in columnar format in the ColumnDataCollection, we (almost) always go to the FLAT representation, i.e., CONSTANT and DICTIONARY Vectors are flattened. For most data types this is OK, but for strings, this can cause the size of the intermediates to increase drastically.

Keeping the VectorType the same would be ideal, but when we scan, we expect the Vectors to be FLAT, so this would be a big (and bug-prone) effort, so it's not really feasible (for now). Instead, this PR copies over just the unique strings, and creates a FLAT Vector of strings that point into it. This solves most of the problem while keeping a very low code footprint.

This optimization will trigger always for CONSTANT Vectors, but for DICTIONARY Vectors only if the dictionary size is less than half of our STANDARD_VECTOR_SIZE. The dictionary is copied once per Vector. This can be improved in the future by copying the dictionary just once. This is totally feasible, but more effort to implement. What I've implemented here are the first steps in that direction, but I've chosen to keep this PR simple for now, as I think this solves 80% of the problem with just 20% of the effort.

I did a quick benchmark by copying the l_shipinstruct column from lineitem from TPC-H at SF100 from Parquet to Parquet, and this reduced the peak RSS from 5.1 GB to 3.5 GB. The speed was pretty much the same. I can come up with a lot more degenerate examples where memory usage would be reduced by orders of magnitude.

…ending to ColumnDataCollection

… for TupleDataCollection

…ling

Mytherin · 2025-06-12T15:20:50Z

Thanks!

lnkuiper added 6 commits June 10, 2025 16:16

retain constant/dictionary compression for strings (kind of) when app…

04d6277

…ending to ColumnDataCollection

don't forget the offset

d701a71

rework swizzling for ColumnDataCollection to be the same as swizzling…

21dd1eb

… for TupleDataCollection

more verification, tracking down issues with new string pointer swizz…

5e0c099

…ling

only update pointer if not copied

c7a82e9

memory limit only works for larger vector sizes

f38cfb3

duckdb-draftbot marked this pull request as draft June 12, 2025 07:15

lnkuiper marked this pull request as ready for review June 12, 2025 07:15

lnkuiper added the Ready For Review label Jun 12, 2025

Mytherin merged commit fd1b726 into duckdb:main Jun 12, 2025
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Leverage `VectorType` in `ColumnDataCollection` #17881

Leverage `VectorType` in `ColumnDataCollection` #17881

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Leverage VectorType in ColumnDataCollection #17881

Leverage VectorType in ColumnDataCollection #17881

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Leverage `VectorType` in `ColumnDataCollection` #17881

Leverage `VectorType` in `ColumnDataCollection` #17881