8000 Support explicit aggregate state export, re-combination and finalisation by hannes · Pull Request #2998 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Support explicit aggregate state export, re-combination and finalisation #2998

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Feb 10, 2022

Conversation

hannes
Copy link
Member
@hannes hannes commented Jan 27, 2022

This PR adds a feature to explicitly control aggregate state combination and finalization. Aggregate can be requested to export their internal state instead of the final result using the EXPORT_STATE modifier on the aggregate function.

For example, while SELECT SUM(42) returns - predictably - 42, SELECT SUM(42) EXPORT_STATE will return a weird sequence of bytes. This byte sequence is not directly useful, because it is a representation of the internal state of the SUM aggregate function after looking at all values that were passed in (only 42 in this case). This state can be saved in a table, passed around, and later used in the new functions COMBINE and FINALIZE.

FINALIZE converts the aggregate state back into the result of the aggregation. For example, SELECT FINALIZE(SUM(42) EXPORT_STATE) is a not very elegant way of getting the same result as SELECT SUM(42).

COMBINE can combine aggregate states from two aggregations with the same function and data types. So we could say SELECT COMBINE(SUM(42) EXPORT_STATE, SUM(24) EXPORT_STATE), which would take the aggregate state of both SUM aggregations and combine the result. The result of combine is the combined aggregate state, which can then be FINALIZEd. As one might expect, SELECT FINALIZE(COMBINE(SUM(42) EXPORT_STATE, SUM(24) EXPORT_STATE)) returns the same as SELECT SUM(42) + SUM(24), 66. As the result of COMBINE is just another aggregate state, it can be chained, e.g. SELECT FINALIZE(COMBINE(COMBINE(SUM(42) EXPORT_STATE, SUM(24) EXPORT_STATE), SUM(12) EXPORT_STATE)).

It is not allowed to combine states of different aggregates, e.g. SELECT COMBINE(SUM(42) EXPORT_STATE, AVG(24) EXPORT_STATE) will throw an error. It is also not allowed to combine aggregate states operating on different types, since depending on the input type the aggregate will have different internal states. For example, if we create a simple table

CREATE TABLE test (a INTEGER, b double);
INSERT INTO test VALUES (42, 4.2);

We cannot combine a SUM of a and b, this will throw an error: SELECT COMBINE(SUM(a) EXPORT_STATE, SUM(b) EXPORT_STATE) FROM test, Cannot COMBINE aggregate states from different functions, sum(INTEGER)::HUGEINT <> sum(DOUBLE)::DOUBLE. Isn't that helpful?

What is of course allowed is to combine states if you cast the argument types first, e.g. SELECT FINALIZE(COMBINE(SUM(a::DOUBLE) EXPORT_STATE, SUM(b::DOUBLE) EXPORT_STATE)) FROM test will return 46.2 as it should.

COMBINE has special NULL handling. If one of the arguments to COMBINE is NULL and the other is not, the result will be the non-NULL argument. If both arguments are NULL, the result is also going to be NULL. This non-standard behavior is chosen to allow chaining of aggregates without a lot of CASE expressions.

Some say this feature can be used to jury-rig a float of DuckDB instances to compute distributed aggregation results without repartitioning the data.

CC @Y-- @dforsber

@hannes hannes requested a review from Mytherin January 27, 2022 08:59
Copy link
Collaborator
@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks excellent! Exciting stuff. Some comments:

Copy link
Contributor
@hawkfish hawkfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more feedback, but getting there. Sorry I don't have more time.

…med 'state' which should be numerous and also occur in TPCDS
@hannes
Copy link
Member Author
hannes commented Jan 30, 2022

Renamed EXPORT STATE to EXPORT_STATE after all because otherwise the unquoted column name state stops working and its too common to justify that IMHO

@Alex-Monahan
Copy link
Contributor

All the Geospatial folks would have had a rough time... ;-)

@hannes hannes added the feature label Jan 31, 2022
Copy link
Collaborator
@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Ready to merge after feature freeze ends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0