Bitpacking mode info #15623

arjenpdevries · 2025-01-08T18:22:44Z

Version of #15622 without (I think) merge conflicts, which was a continuation of #15544

This implements the suggested serialize/deserialize, in line with the suggestions by @Tishj (and building on his initials steps).

More detailed info in #15544

…Santa did not bring it; so I tried myself ;-) This pull request will do: - Create a BitpackingSegmentState that overrides GetSegmentInfo to return mode info - Add BitpackingInitSegment that returns segment-info (used in ColumnData) - Add BitpackingInitSegment / init_segment to the compressionfunction (I also renamed BitpackingCompressState to BitpackingCompressionState for consistency with other compression code.) The approach implemented seems to work correctly and passes tests locally. However, the new information is only shown when you first close the database and reopen it. If you only use `CHECKPOINT`, `init_segment` is not called and the modes not reported. I am stuck where to fix this; it seems that this would be a step related to `ConvertToPersistent` that I overlook but I do not understand what I need to do to make it work. I have tested using the following SQL (but because of this incomplete behaviour not yet added as a test): ``` pragma force_compression = 'BitPacking'; CREATE or replace TABLE test (a INTEGER, b INTEGER); insert into test values (10,12), (11,12), (12,11), (NULL,NULL); insert into test values (10,12), (33,33), (33,33), (10,12); select segment_id, row_group_id, block_id, block_offset, compression, segment_info FROM pragma_storage_info('test') order by segment_id, row_group_id asc; checkpoint; select segment_id, row_group_id, block_id, block_offset, compression, segment_info FROM pragma_storage_info('test') order by segment_id, row_group_id asc; ``` After closing the CLI and restarting it on the same database, and then doing ``` select segment_id, row_group_id, block_id, block_offset, compression, segment_info FROM pragma_storage_info('test') order by segment_id, row_group_id asc; ``` the `pragma_storage_info` reports `segment_info` that includes the bitpacking mode. Hoping for some DuckDB compression guru advice, A. ``` D select segment_id, row_group_id, block_id, block_offset, compression, segment_info FROM pragma_storage_info('test') order by segment_id, row_group_id asc; ┌────────────┬──────────────┬──────────┬──────────────┬──────────────┬──────────────┐ │ segment_id │ row_group_id │ block_id │ block_offset │ compression │ segment_info │ │ int64 │ int64 │ int64 │ int64 │ varchar │ varchar │ ├────────────┼──────────────┼──────────┼──────────────┼──────────────┼──────────────┤ │ 0 │ 0 │ 1 │ 0 │ BitPacking │ Mode: for │ │ 0 │ 0 │ 1 │ 48 │ Uncompressed │ │ │ 0 │ 0 │ 1 │ 304 │ BitPacking │ Mode: for │ │ 0 │ 0 │ 1 │ 352 │ Uncompressed │ │ └────────────┴──────────────┴──────────┴──────────────┴──────────────┴──────────────┘ ```

…to bitpacking-mode-info

…er processing, simpler tests).

Tishj · 2025-01-08T22:38:42Z

src/storage/compression/bitpacking.cpp

+	return std::move(result);
+}
+
+void CleanupState(ColumnSegment &segment) {


I don't think we need this, cleanup state is necessary for state that holds lifetimes outside of the scope of the segment, such as additional blocks that it has created - which is what the UncompressedStringSegmentState uses it for.

UncompressedStringStorage can create additional blocks to hold "overflow" strings, which are bigger than a certain threshold (I think it's 4096 by default)
When the column segment that these additional blocks are a part of gets dropped, we need to inform the block manager that these additional blocks can now also be safely reused.

Of course, should have realized. Will remove.

Just out of curiosity - we now write extra information, could it happen that this requires an extra block (in the case where everything was filled completely with data and I want to write out the bitmode histogram) or does this extra serialization data get written to a separate place in storage?

Tishj

Thanks, this looks like it'll work from reading the code 👍
I do however want to see at least one test making sure this is working properly

Looking at:

static constexpr const idx_t BITPACKING_METADATA_GROUP_SIZE = STANDARD_VECTOR_SIZE > 512 ? STANDARD_VECTOR_SIZE : 2048;

We should be deciding per 2048 values which mode to use, so given 10000 tuples to compress, this should result in a total of 5 groups
(also your test should probably use require vector_size 2048 because the vector size will impact the group size)

…oved use of class template as T is not needed.

Tishj

Thanks, looks great! 👍

Mytherin

Thanks for the PR!

Unless I'm misunderstanding the PR - this seems to store more information in the database only to provide the segment information during the storage_info callback. The new serialize/deserialize logic will also hurt forwards compatibility of database files.

Given this information exists elsewhere already in the blocks, perhaps we can instead introduce a separate get_segment_info callback to the compression method so that we don't need to introduce the new compressed segment state?

Tishj · 2025-01-09T10:49:54Z

Tishj commented

Jan 9, 2025

•

The PR adds segment_info for the bitpacked column segments

query I
SELECT
	segment_info
FROM
	pragma_storage_info('test_bp')
WHERE segment_type NOT IN ('VALIDITY')
----
CONSTANT: 9, DELTA_FOR: 1

I'm not sure about the serialization / forwards compatibility complications, but this sounds like a welcome change for observability ?
But now that I think about it, I guess this does get complicated by the fact that old versions don't produce this information, and we need to safely handle this

I also feel like Logging this information during checkpointing would be a nicer way to provide this observability, waiting patiently on #15119 🙏

I think I just understood this part:

Given this information exists elsewhere already in the blocks, perhaps we can instead introduce a separate get_segment_info callback to the compression method so that we don't need to introduce the new compressed segment state?

This is the existing code to get segment info
Because we have access to the segment, we could scan the segment here instead.
~~I thought that was too expensive, but perhaps introducing serialization is more expensive in terms of backwards compatibility~~
(as shown in later comments scanning the block for this information is pretty trivial)

		auto segment_state = segment->GetSegmentState();
		if (segment_state) {
			column_info.segment_info = segment_state->GetSegmentInfo();
			column_info.additional_blocks = segment_state->GetAdditionalBlocks();
		}

We can add a get_segment_info method to the CompressionFunction, when it's set we can call that instead of getting the segment state

Mytherin · 2025-01-09T10:53:25Z

The change itself is welcome for sure - my point is that this information is already stored in the (compressed) data itself. We should be able to just read it from the blocks. That then (1) also works with older database files, (2) does not require storing additional data in the database file, increasing the size, and (3) does not break the ability of older DuckDB versions from reading this compression due to the introduction of new information they do not understand.

arjenpdevries · 2025-01-09T10:57:14Z

Thanks Mark!

My initial approach used only init-segment, but didn't work at the moment of checkpointing; in discussion we switched to this design.

I guess I can alternatively scan all segments in init-segment to recreate this information upon the column being loaded from disk. Have to look into the exact mechanics but will try.

(It may imply a longer load time?)

Tishj · 2025-01-09T10:58:50Z

	void LoadNextGroup() {
		D_ASSERT(bitpacking_metadata_ptr > handle.Ptr() &&
		         bitpacking_metadata_ptr < handle.Ptr() + current_segment.GetBlockManager().GetBlockSize());
		current_group_offset = 0;
		current_group = DecodeMeta(reinterpret_cast<bitpacking_metadata_encoded_t *>(bitpacking_metadata_ptr));

		bitpacking_metadata_ptr -= sizeof(bitpacking_metadata_encoded_t);

All the group metadata is neatly packed together, so scanning just that portion should be cheap and efficient

The ColumnSegment has a count to tell us how many tuples are stored on it
As discussed before, BITPACKING_METADATA_GROUP_SIZE is the amount of tuples stored in a group

map<BitpackingMode, idx_t> counts;
auto tuple_count = segment.count.load();
BitpackingScanState<T> scan_state(segment);
for (idx_t i = 0; i < tuple_count; i += BITPACKING_METADATA_GROUP_SIZE) {
    counts[scan_state.current_group.mode]++;
}

// stringify the counts and return

should be all you need 👍

Mytherin · 2025-01-09T11:01:19Z

Thanks Mark!

My initial approach used only init-segment, but didn't work at the moment of checkpointing; in discussion we switched to this design.

I guess I can alternatively scan all segments in init-segment to recreate this information upon the column being loaded from disk. Have to look into the exact mechanics but will try.

(It may imply a longer load time?)

GetSegmentInfo is only called very rarely (when the storage_info function is called) - and is not very performance sensitive. Ideally the code introduced in this PR is only ever run when the user calls storage_info - and does not impact any other (much more common) code paths (i.e. does not slow down compression, load, etc).

Tishj · 2025-01-09T11:07:53Z

Thanks Mark!

My initial approach used only init-segment, but didn't work at the moment of checkpointing; in discussion we switched to this design.

I guess I can alternatively scan all segments in init-segment to recreate this information upon the column being loaded from disk. Have to look into the exact mechanics but will try.

(It may imply a longer load time?)

Yes, so you were mostly on the right track to do what Mark is suggesting, but the place it was being done was wrong.
You were doing this in InitSegment, which receives both fresh and existing blocks
For an existing block this information is available, but for a fresh one it doesn't exist yet

See the second part of #15623 (comment)

…to bitpacking-mode-info

arjenpdevries · 2025-01-18T22:51:11Z

So followed the suggestion to use the map but kept segment_info a column of VARCHAR for now. LMK if you'd rather see that changed. Otherwise seems good to go.

Tishj

Thanks, I think it's fine to keep it as VARCHAR for now 👍

arjenpdevries · 2025-01-22T17:59:51Z

(only removed two includes, it should pass tests as before)

Mytherin · 2025-01-22T21:46:40Z

Looks like they were not unnecessary :)

Turn return type into a InsertionOrderPreservingMap<string> similar to PhysicalOperator::ParamsToString (following suggestion by Mark). Keep segment_info column of type VARCHAR to limit changes.

arjenpdevries · 2025-01-23T01:44:28Z

OK don't believe the LSP without checking twice...

arjenpdevries · 2025-01-23T07:59:28Z

Will not touch it any more! 😀

Mytherin · 2025-02-06T06:57:28Z

Thanks!

Bitpacking mode info (duckdb/duckdb#15623)

arjenpdevries added 8 commits January 4, 2025 04:54

ran formatter

14f7b01

Does not need to include itself

4c07566

Implemented Serialize/Deserialize for the Bitmode Histogram

c22cdaf

Minor

bef1eea

Merge branch 'main' into bitpacking-mode-info

d278220

Merge branch 'bitpacking-mode-info' of github.com:informagi/duckdb in…

b939592

…to bitpacking-mode-info

Serializer class forward def needed

ce0f949

duckdb-draftbot marked this pull request as draft January 8, 2025 19:33

Use msp instead of unordered_map so order in info remains fixed (easi…

938066e

…er processing, simpler tests).

arjenpdevries marked this pull request as ready for review January 8, 2025 20:17

Satisfy tidy?

13bd309

duckdb-draftbot marked this pull request as draft January 8, 2025 21:52

arjenpdevries marked this pull request as ready for review January 8, 2025 21:53

Tishj reviewed Jan 8, 2025

View reviewed changes

Tishj suggested changes Jan 8, 2025

View reviewed changes 8000

arjenpdevries added 2 commits January 9, 2025 08:53

Adapted function names for (hopefully successful) amalgation, and rem…

16685cf

…oved use of class template as T is not needed.

Added tests for the new functionality

50d8aa4

duckdb-draftbot marked this pull request as draft January 9, 2025 07:56

removed unnecessary cleanup implementation

61cab2c

arjenpdevries marked this pull request as ready for review January 9, 2025 07:58

Tishj approved these changes Jan 9, 2025

View reviewed changes

Mytherin reviewed Jan 9, 2025

View reviewed changes

Mytherin added the Changes Requested label Jan 9, 2025

arjenpdevries added 2 commits January 16, 2025 15:24

Oops my mistake!

fb98f09

Merge branch 'bitpacking-mode-info' of github.com:informagi/duckdb in…

ffdfc8f

…to bitpacking-mode-info

duckdb-draftbot marked this pull request as draft January 16, 2025 14:56

arjenpdevries added 2 commits January 18, 2025 17:12

rewrite if-then-else

ad876b5

include no longer needed

dfd7ee4

arjenpdevries marked this pull request as ready for review January 18, 2025 17:29

Tishj approved these changes Jan 19, 2025

View reviewed changes

Mytherin added Ready To Merge feature and removed Changes Requested labels Jan 20, 2025

duckdb-draftbot marked this pull request as draft January 22, 2025 17:58

arjenpdevries marked this pull request as ready for review January 22, 2025 17:58

Modify get_segment_info return type

b464979

Turn return type into a InsertionOrderPreservingMap<string> similar to PhysicalOperator::ParamsToString (following suggestion by Mark). Keep segment_info column of type VARCHAR to limit changes.

arjenpdevries force-pushed the bitpacking-mode-info branch from 3ad9d04 to b464979 Compare January 23, 2025 01:42

duckdb-draftbot marked this pull request as draft January 23, 2025 01:50

arjenpdevries marked this pull request as ready for review January 23, 2025 07:58

Mytherin merged commit 670e905 into duckdb:main Feb 6, 2025
48 checks passed

arjenpdevries deleted the bitpacking-mode-info branch February 21, 2025 08:26

Antonov548 added a commit to Antonov548/duckdb-r that referenced this pull request Feb 26, 2025

vendor: Update vendored sources to duckdb/duckdb@670e905

ed86439

Bitpacking mode info (duckdb/duckdb#15623)

krlmlr pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 5, 2025

vendor: Update vendored sources to duckdb/duckdb@670e905

b59d872

Bitpacking mode info (duckdb/duckdb#15623)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025

vendor: Update vendored sources to duckdb/duckdb@670e905

B5B4

ab85e67

Bitpacking mode info (duckdb/duckdb#15623)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025

vendor: Update vendored sources to duckdb/duckdb@670e905

7af13c9

Bitpacking mode info (duckdb/duckdb#15623)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025

vendor: Update vendored sources to duckdb/duckdb@670e905

bc68e66

Bitpacking mode info (duckdb/duckdb#15623)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025

vendor: Update vendored sources to duckdb/duckdb@670e905

b13a634

Bitpacking mode info (duckdb/duckdb#15623)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bitpacking mode info #15623

Bitpacking mode info #15623

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Bitpacking mode info #15623

Bitpacking mode info #15623

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!