Support explicit aggregate state export, re-combination and finalisation #2998

hannes · 2022-01-27T08:59:42Z

This PR adds a feature to explicitly control aggregate state combination and finalization. Aggregate can be requested to export their internal state instead of the final result using the EXPORT_STATE modifier on the aggregate function.

For example, while SELECT SUM(42) returns - predictably - 42, SELECT SUM(42) EXPORT_STATE will return a weird sequence of bytes. This byte sequence is not directly useful, because it is a representation of the internal state of the SUM aggregate function after looking at all values that were passed in (only 42 in this case). This state can be saved in a table, passed around, and later used in the new functions COMBINE and FINALIZE.

FINALIZE converts the aggregate state back into the result of the aggregation. For example, SELECT FINALIZE(SUM(42) EXPORT_STATE) is a not very elegant way of getting the same result as SELECT SUM(42).

COMBINE can combine aggregate states from two aggregations with the same function and data types. So we could say SELECT COMBINE(SUM(42) EXPORT_STATE, SUM(24) EXPORT_STATE), which would take the aggregate state of both SUM aggregations and combine the result. The result of combine is the combined aggregate state, which can then be FINALIZEd. As one might expect, SELECT FINALIZE(COMBINE(SUM(42) EXPORT_STATE, SUM(24) EXPORT_STATE)) returns the same as SELECT SUM(42) + SUM(24), 66. As the result of COMBINE is just another aggregate state, it can be chained, e.g. SELECT FINALIZE(COMBINE(COMBINE(SUM(42) EXPORT_STATE, SUM(24) EXPORT_STATE), SUM(12) EXPORT_STATE)).

It is not allowed to combine states of different aggregates, e.g. SELECT COMBINE(SUM(42) EXPORT_STATE, AVG(24) EXPORT_STATE) will throw an error. It is also not allowed to combine aggregate states operating on different types, since depending on the input type the aggregate will have different internal states. For example, if we create a simple table

CREATE TABLE test (a INTEGER, b double);
INSERT INTO test VALUES (42, 4.2);

We cannot combine a SUM of a and b, this will throw an error: SELECT COMBINE(SUM(a) EXPORT_STATE, SUM(b) EXPORT_STATE) FROM test, Cannot COMBINE aggregate states from different functions, sum(INTEGER)::HUGEINT <> sum(DOUBLE)::DOUBLE. Isn't that helpful?

What is of course allowed is to combine states if you cast the argument types first, e.g. SELECT FINALIZE(COMBINE(SUM(a::DOUBLE) EXPORT_STATE, SUM(b::DOUBLE) EXPORT_STATE)) FROM test will return 46.2 as it should.

COMBINE has special NULL handling. If one of the arguments to COMBINE is NULL and the other is not, the result will be the non-NULL argument. If both arguments are NULL, the result is also going to be NULL. This non-standard behavior is chosen to allow chaining of aggregates without a lot of CASE expressions.

Some say this feature can be used to jury-rig a float of DuckDB instances to compute distributed aggregation results without repartitioning the data.

CC @Y-- @dforsber

… verification now

…since it might lead to different state types depending on where aggregate is run

Mytherin

Looks excellent! Exciting stuff. Some comments:

src/function/scalar/system/aggregate_export.cpp

test/sql/aggregate/aggregates/test_state_export.test

src/function/scalar/system/aggregate_export.cpp

…sent for now

src/function/scalar/system/aggregate_export.cpp

test/sql/aggregate/aggregates/test_state_export.test

hawkfish

Some more feedback, but getting there. Sorry I don't have more time.

…med 'state' which should be numerous and also occur in TPCDS

hannes · 2022-01-30T05:33:16Z

Renamed EXPORT STATE to EXPORT_STATE after all because otherwise the unquoted column name state stops working and its too common to justify that IMHO

Alex-Monahan · 2022-01-30T05:42:08Z

All the Geospatial folks would have had a rough time... ;-)

Mytherin

Looks great! Ready to merge after feature freeze ends.

hannes added 12 commits January 24, 2022 13:38

added EXPORT STATE keyword to aggregate parsing

eca0b38

transform EXPORT STATE to new flag in function expression

c3e1def

binding the export modifier and adding the FINALIZE() scalar function

3d4d174

added combine function, aggregate export survives optimizer and query…

992d89a

… verification now

added COMBINE function and initial test case

4592c1a

merge

625b5fe

merge

2d92489

fix optimizer problem, we cannot do statistics propagation for those …

cab5fb4

…since it might lead to different state types depending on where aggregate is run

minor cleanup, orrification next

29a420c

merge

7dd5ec7

vectorized finalize and orrification for finalize and combine

4f7cd5c

add local state for combine and finalize to stop leaking

b290e49

hannes requested a review from Mytherin January 27, 2022 08:59

hannes added 2 commits January 27, 2022 11:50

fixing deserialization

c8a299c

added nicer stringification for AGGREGATE_STATE

f56c868

Mytherin reviewed Jan 27, 2022

View reviewed changes

hannes added 2 commits January 28, 2022 08:51

adding error when trying to export window function state

d2856be

checking for presence of custom bind functions and throw error if pre…

da7a04b

…sent for now