-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Adding WITH ORDINALITY to DuckDB #16581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…arting to lose my mind over here)
Thanks for the PR! I think there are two distinct steps that need to happen to have
Ideally this is generated during the bind phase already - i.e. when binding a scanner table function that has ordinality specified, the |
…ess with, i discarded these changes)
I'd like to note that when using any version higher than 24.1.0 for the python module black, the following files get changed when running "make format-fix":
Just so you know, it seems like this package has changed the formatting rules. It now inserts blank lines after certain comments. |
…al get is not informed anymore & correct the tests (typos)
Hey @Mytherin , |
In regards to my previous comment: now it is really ready for review 😅 Please let me know your thoughts 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Great that this is being picked up again. Looks good - some comments below.
It would be great to get some more tests going as well:
- WITH ORDINALITY on larger tables, e.g. you could use the TPC-H data generator and then create tests with those
- WITH ORDINALITY used in a view definition
- WITH ORDINALITY on a view (should this even work?)
- WITH ORDINALITY on a CTE
- WITH ORDINALITY on a recursive CTE
- WITH ORDINALITY used within a recursive CTE
- WITH ORDINALITY where the table already contains a column called "ordinality"
- Filters on the ORDINALITY column
- Projecting out all columns except for the ORDINALITY column
|
||
namespace duckdb { | ||
|
||
enum class Ordinality_request_t : uint8_t { NOT_REQUESTED = 9E88 0, REQUESTED = 1 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use consistent naming with the rest of the code base, i.e. call this OrdinalityType
And instead of NOT_REQUESTED
- maybe use WITH_ORDINALITY
and WITHOUT_ORDINALITY
, i.e.:
OrdinalityType::WITH_ORDINALITY
|
||
enum class Ordinality_request_t : uint8_t { NOT_REQUESTED = 0, REQUESTED = 1 }; | ||
|
||
struct ordinality_data_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OrdinalityData
@@ -87,6 +96,8 @@ OperatorResultType PhysicalTableInOutFunction::Execute(ExecutionContext &context | |||
state.input_chunk.SetCardinality(1); | |||
state.row_index++; | |||
state.new_row = false; | |||
lock_guard<mutex> guard(gstate.ordinality_lock); | |||
gstate.current_ordinality_idx = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the other scan state is all located in the local state - I suspect the global state is the wrong place for the current_ordinality_idx
. Do we have a test with a larger UNNEST
or similar table function that triggers multi-threading?
auto result = function.in_out_function(context, data, input, chunk); | ||
if (function.ordinality_data.ordinality_request == Ordinality_request_t::REQUESTED) { | ||
const idx_t ordinality = chunk.size(); | ||
function.ordinality_data.SetOrdinality(chunk, gstate.current_ordinality_idx, ordinality); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are reading gstate.current_ordinality_idx
without a lock, whereas we are modifying it below with a lock - one of these is wrong. I suspect we need to move this to the thread-local state anyway however, in which case we don't need a lock at all.
@@ -120,6 +123,12 @@ SourceResultType PhysicalTableScan::GetData(ExecutionContext &context, DataChunk | |||
break; | |||
} | |||
|
|||
if (function.ordinality_data.ordinality_request == Ordinality_request_t::REQUESTED) { | |||
idx_t ordinality = chunk.size(); | |||
function.ordinality_data.SetOrdinality(chunk, g_state.ordinality_idx, ordinality); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PhysicalTableScan
- I think we can instead perform this using a window function. This will currently be incorrect when multi-threading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not need to re-generate the serialized plans here. If this is necessary something is going wrong with serialization which can impact forward and/or backward compatibility. Can you revert this change and instead fix what caused this to be necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - we should not be re-generating the storage version
PRAGMA enable_verification | ||
|
||
query II | ||
SELECT o,range FROM range(1) WITH ORDINALITY AS _(range,o); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we unify some of these test files? We are adding a lot of test files that contain almost no tests in them. Perhaps we can have two files (one for table in-out functions, one for "standard" table functions like read_csv, read_parquet and table scans).
query III | ||
SELECT * FROM read_csv('test/sql/ordinality/a1.csv') WITH ORDINALITY ORDER BY Numbers,Chars,ordinality; | ||
---- | ||
7800 values hashing to d86e8a5e8df33cec28e0ed7b6c386419 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we prefer to avoid using the hash comparison feature - we can verify two results are the same using the labeled results, we can use that to verify results instead, e.g.:
query III nosort read_csv_result
SELECT * FROM read_csv('test/sql/ordinality/a1.csv') WITH ORDINALITY ORDER BY Numbers,Chars,ordinality;
query III nosort read_csv_result
SELECT *, row_number() OVER () AS ordinality FROM read_csv('test/sql/ordinality/a1.csv') ORDER BY Numbers,Chars,ordinality;
|
||
|
||
query III | ||
SELECT * FROM test AS t, LATERAL range(t.a) WITH ORDINALITY AS _(range,ordinality) ORDER BY range; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we alias ordinality to something else and explicitly select it to ensure the aliasing works? e.g. SELECT my_ord FROM range(t.a) _(range, my_ordinality)
This is a continuation of #9014
I am resuming work on this function. If I understand correctly, you still want me to implement this feature by internally producing a ROW_NUMBER() window function instead of producing the column manually.
My first step is to take a look again at how these operators work and then try to find a solution.