8000 Adding WITH ORDINALITY to DuckDB by niykko · Pull Request #16581 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Adding WITH ORDINALITY to DuckDB #16581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 63 commits into
base: main
Choose a base branch
from

Conversation

niykko
Copy link
Contributor
@niykko niykko commented Mar 10, 2025

This is a continuation of #9014

I am resuming work on this function. If I understand correctly, you still want me to implement this feature by internally producing a ROW_NUMBER() window function instead of producing the column manually.

My first step is to take a look again at how these operators work and then try to find a solution.

@Mytherin
Copy link
Collaborator

Thanks for the PR!

I think there are two distinct steps that need to happen to have WITH ORDINALITY work:

  • The scanner table functions (e.g. read_parquet, etc) that do not support lateral joins can support WITH ORDINALITY by pushing a ROW_NUMBER() OVER ()
  • The table in-out functions that do support lateral joins need to explicitly handle WITH ORDINALITY, and compute the ordinality column during execution (ideally in PhysicalTableInOutFunction).

Ideally this is generated during the bind phase already - i.e. when binding a scanner table function that has ordinality specified, the LogicalWindow with row_number() over () is generated. The optimizer can then (potentially) remove this again or do something more clever with this.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft March 11, 2025 14:57
@niykko
Copy link
Contributor Author
niykko commented Mar 13, 2025

I'd like to note that when using any version higher than 24.1.0 for the python module black, the following files get changed when running "make format-fix":

modified: scripts/regression/benchmark.py
modified: tools/pythonpkg/tests/fast/numpy/test_numpy_new_path.py
modified: tools/pythonpkg/tests/fast/udf/test_null_filtering.py
modified: tools/pythonpkg/tests/fast/udf/test_scalar.py

Just so you know, it seems like this package has changed the formatting rules. It now inserts blank lines after certain comments.

…al get is not informed anymore & correct the tests (typos)
@niykko niykko marked this pull request as ready for review May 23, 2025 13:41
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 23, 2025 21:38
@niykko niykko marked this pull request as ready for review May 26, 2025 10:19
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 3, 2025 09:53
@niykko niykko marked this pull request as ready for review June 3, 2025 12:09
@niykko
Copy link
Contributor Author
niykko commented Jun 3, 2025

Hey @Mytherin ,
this PR is now ready for review. Please let me know what you think 😃
Please also note that the failing unittest ('Test deserialized plans from file') also fails on the production branch, so I assume my additions are not what causes this error.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 4, 2025 11:55
@niykko niykko marked this pull request as ready for review June 4, 2025 11:56
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 5, 2025 08:45
@niykko niykko marked this pull request as ready for review June 5, 2025 08:46
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 5, 2025 14:20
@niykko niykko marked this pull request as ready for review June 5, 2025 14:58
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 6, 2025 09:18
@niykko niykko marked this pull request as ready for review June 6, 2025 09:22
@niykko
Copy link
Contributor Author
niykko commented Jun 6, 2025

In regards to my previous comment: now it is really ready for review 😅
The serialization unittest is not broken anymore and the thread sanitizer has been silenced since we use proper mutex now.

Please let me know your thoughts 🙏

Copy link
Collaborator
@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Great that this is being picked up again. Looks good - some comments below.

It would be great to get some more tests going as well:

  • WITH ORDINALITY on larger tables, e.g. you could use the TPC-H data generator and then create tests with those
  • WITH ORDINALITY used in a view definition
  • WITH ORDINALITY on a view (should this even work?)
  • WITH ORDINALITY on a CTE
  • WITH ORDINALITY on a recursive CTE
  • WITH ORDINALITY used within a recursive CTE
  • WITH ORDINALITY where the table already contains a column called "ordinality"
  • Filters on the ORDINALITY column
  • Projecting out all columns except for the ORDINALITY column


namespace duckdb {

enum class Ordinality_request_t : uint8_t { NOT_REQUESTED = 9E88 0, REQUESTED = 1 };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use consistent naming with the rest of the code base, i.e. call this OrdinalityType

And instead of NOT_REQUESTED - maybe use WITH_ORDINALITY and WITHOUT_ORDINALITY, i.e.:

OrdinalityType::WITH_ORDINALITY


enum class Ordinality_request_t : uint8_t { NOT_REQUESTED = 0, REQUESTED = 1 };

struct ordinality_data_t {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrdinalityData

@@ -87,6 +96,8 @@ OperatorResultType PhysicalTableInOutFunction::Execute(ExecutionContext &context
state.input_chunk.SetCardinality(1);
state.row_index++;
state.new_row = false;
lock_guard<mutex> guard(gstate.ordinality_lock);
gstate.current_ordinality_idx = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the other scan state is all located in the local state - I suspect the global state is the wrong place for the current_ordinality_idx. Do we have a test with a larger UNNEST or similar table function that triggers multi-threading?

auto result = function.in_out_function(context, data, input, chunk);
if (function.ordinality_data.ordinality_request == Ordinality_request_t::REQUESTED) {
const idx_t ordinality = chunk.size();
function.ordinality_data.SetOrdinality(chunk, gstate.current_ordinality_idx, ordinality);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are reading gstate.current_ordinality_idx without a lock, whereas we are modifying it below with a lock - one of these is wrong. I suspect we need to move this to the thread-local state anyway however, in which case we don't need a lock at all.

@@ -120,6 +123,12 @@ SourceResultType PhysicalTableScan::GetData(ExecutionContext &context, DataChunk
break;
}

if (function.ordinality_data.ordinality_request == Ordinality_request_t::REQUESTED) {
idx_t ordinality = chunk.size();
function.ordinality_data.SetOrdinality(chunk, g_state.ordinality_idx, ordinality);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PhysicalTableScan - I think we can instead perform this using a window function. This will currently be incorrect when multi-threading.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need to re-generate the serialized plans here. If this is necessary something is going wrong with serialization which can impact forward and/or backward compatibility. Can you revert this change and instead fix what caused this to be necessary?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - we should not be re-generating the storage version

PRAGMA enable_verification

query II
SELECT o,range FROM range(1) WITH ORDINALITY AS _(range,o);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unify some of these test files? We are adding a lot of test files that contain almost no tests in them. Perhaps we can have two files (one for table in-out functions, one for "standard" table functions like read_csv, read_parquet and table scans).

query III
SELECT * FROM read_csv('test/sql/ordinality/a1.csv') WITH ORDINALITY ORDER BY Numbers,Chars,ordinality;
----
7800 values hashing to d86e8a5e8df33cec28e0ed7b6c386419
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we prefer to avoid using the hash comparison feature - we can verify two results are the same using the labeled results, we can use that to verify results instead, e.g.:

query III nosort read_csv_result
SELECT * FROM read_csv('test/sql/ordinality/a1.csv') WITH ORDINALITY ORDER BY Numbers,Chars,ordinality;

query III nosort read_csv_result
SELECT *, row_number() OVER () AS ordinality FROM read_csv('test/sql/ordinality/a1.csv') ORDER BY Numbers,Chars,ordinality;



query III
SELECT * FROM test AS t, LATERAL range(t.a) WITH ORDINALITY AS _(range,ordinality) ORDER BY range;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we alias ordinality to something else and explicitly select it to ensure the aliasing works? e.g. SELECT my_ord FROM range(t.a) _(range, my_ordinality)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0