Leverage VectorType
in ColumnDataCollection
#17881
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When materializing data in columnar format in the
ColumnDataCollection
, we (almost) always go to theFLAT
representation, i.e.,CONSTANT
andDICTIONARY
Vector
s are flattened. For most data types this is OK, but for strings, this can cause the size of the intermediates to increase drastically.Keeping the
VectorType
the same would be ideal, but when we scan, we expect theVector
s to beFLAT
, so this would be a big (and bug-prone) effort, so it's not really feasible (for now). Instead, this PR copies over just the unique strings, and creates aFLAT
Vector
of strings that point into it. This solves most of the problem while keeping a very low code footprint.This optimization will trigger always for
CONSTANT
Vector
s, but forDICTIONARY
Vector
s only if the dictionary size is less than half of ourSTANDARD_VECTOR_SIZE
. The dictionary is copied once perVector
. This can be improved in the future by copying the dictionary just once. This is totally feasible, but more effort to implement. What I've implemented here are the first steps in that direction, but I've chosen to keep this PR simple for now, as I think this solves 80% of the problem with just 20% of the effort.I did a quick benchmark by copying the
l_shipinstruct
column fromlineitem
from TPC-H at SF100 from Parquet to Parquet, and this reduced the peak RSS from 5.1 GB to 3.5 GB. The speed was pretty much the same. I can come up with a lot more degenerate examples where memory usage would be reduced by orders of magnitude.