-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Support explicit aggregate state export, re-combination and finalisation #2998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… verification now
…since it might lead to different state types depending on where aggregate is run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks excellent! Exciting stuff. Some comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more feedback, but getting there. Sorry I don't have more time.
…med 'state' which should be numerous and also occur in TPCDS
Renamed |
All the Geospatial folks would have had a rough time... ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Ready to merge after feature freeze ends.
This PR adds a feature to explicitly control aggregate state combination and finalization. Aggregate can be requested to export their internal state instead of the final result using the
EXPORT_STATE
modifier on the aggregate function.For example, while
SELECT SUM(42)
returns - predictably -42
,SELECT SUM(42) EXPORT_STATE
will return a weird sequence of bytes. This byte sequence is not directly useful, because it is a representation of the internal state of theSUM
aggregate function after looking at all values that were passed in (only42
in this case). This state can be saved in a table, passed around, and later used in the new functionsCOMBINE
andFINALIZE
.FINALIZE
converts the aggregate state back into the result of the aggregation. For example,SELECT FINALIZE(SUM(42) EXPORT_STATE)
is a not very elegant way of getting the same result asSELECT SUM(42)
.COMBINE
can combine aggregate states from two aggregations with the same function and data types. So we could saySELECT COMBINE(SUM(42) EXPORT_STATE, SUM(24) EXPORT_STATE)
, which would take the aggregate state of bothSUM
aggregations and combine the result. The result of combine is the combined aggregate state, which can then beFINALIZE
d. As one might expect,SELECT FINALIZE(COMBINE(SUM(42) EXPORT_STATE, SUM(24) EXPORT_STATE))
returns the same asSELECT SUM(42) + SUM(24)
,66
. As the result ofCOMBINE
is just another aggregate state, it can be chained, e.g.SELECT FINALIZE(COMBINE(COMBINE(SUM(42) EXPORT_STATE, SUM(24) EXPORT_STATE), SUM(12) EXPORT_STATE))
.It is not allowed to combine states of different aggregates, e.g.
SELECT COMBINE(SUM(42) EXPORT_STATE, AVG(24) EXPORT_STATE)
will throw an error. It is also not allowed to combine aggregate states operating on different types, since depending on the input type the aggregate will have different internal states. For example, if we create a simple tableWe cannot combine a
SUM
ofa
andb
, this will throw an error:SELECT COMBINE(SUM(a) EXPORT_STATE, SUM(b) EXPORT_STATE) FROM test
,Cannot COMBINE aggregate states from different functions, sum(INTEGER)::HUGEINT <> sum(DOUBLE)::DOUBLE
. Isn't that helpful?What is of course allowed is to combine states if you cast the argument types first, e.g.
SELECT FINALIZE(COMBINE(SUM(a::DOUBLE) EXPORT_STATE, SUM(b::DOUBLE) EXPORT_STATE)) FROM test
will return46.2
as it should.COMBINE
has specialNULL
handling. If one of the arguments toCOMBINE
isNULL
and the other is not, the result will be the non-NULL
argument. If both arguments areNULL
, the result is also going to beNULL
. This non-standard behavior is chosen to allow chaining of aggregates without a lot ofCASE
expressions.Some say this feature can be used to jury-rig a float of DuckDB instances to compute distributed aggregation results without repartitioning the data.
CC @Y-- @dforsber