-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Generalize rowid
into the concept of virtual columns, and make filename
a virtual column in the Parquet/CSV/JSON readers
#16248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… one of these virtual columns
…ng to move it into the bind
…d add union by name test
Antonov548
added a commit
to Antonov548/duckdb-r
that referenced
this pull request
Feb 27, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248) Include extension_util.hpp in libduckdb (duckdb/duckdb#16255)
krlmlr
pushed a commit
to duckdb/duckdb-r
that referenced
this pull request
Mar 3, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248) Include extension_util.hpp in libduckdb (duckdb/duckdb#16255)
Antonov548
added a commit
to Antonov548/duckdb-r
that referenced
this pull request
Mar 4, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248)
krlmlr
pushed a commit
to duckdb/duckdb-r
that referenced
this pull request
Mar 5, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248) Include extension_util.hpp in libduckdb (duckdb/duckdb#16255)
krlmlr
pushed a commit
to duckdb/duckdb-r
that referenced
this pull request
Mar 5, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248)
This was referenced Mar 17, 2025
Mytherin
added a commit
that referenced
this pull request
Apr 4, 2025
… columns in the MultiFileReader (#16979) Follow-up from #16248 This PR reworks the `file_row_number` to be a virtual column in the Parquet reader, so the following query now works: ```sql SELECT l_orderkey, file_row_number FROM lineitem.parquet; ``` This PR also implements the necessary infrastructure for allowing arbitrary virtual columns to be defined by readers, so in the future adding new virtual columns to readers will be much simpler. This rework allows for the removal of a bunch of hacky special-case code around the `file_row_number` column - this can now all live in the Parquet reader itself. Emitting the file row number is as simple as adding the special code (`MultiFileReader::COLUMN_IDENTIFIER_FILE_ROW_NUMBER `) to the set of projected column ids.
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 15, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248) Include extension_util.hpp in libduckdb (duckdb/duckdb#16255)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 15, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248) Include extension_util.hpp in libduckdb (duckdb/duckdb#16255)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 17, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248) Include extension_util.hpp in libduckdb (duckdb/duckdb#16255)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 18, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248) Include extension_util.hpp in libduckdb (duckdb/duckdb#16255)
Mytherin
added a commit
that referenced
this pull request
May 19, 2025
This PR fixes duckdblabs/duckdb-internal#4680 This issue arose because of #16248
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR generalizes the
rowid
column to the broader concept of virtual columns. Previously,rowid
was a special column that existed within DuckDB tables. This column was special in the sense that it could be queried, but would not show up as an actual column of the table when inspecting the table or when runningSELECT * FROM tbl
:The
rowid
column is also used internally for several purposes, such as handling deletes/updates of rows, or performing late materialization.In the current implementation - only the
rowid
virtual column exists, and only tables can implement this virtual column. Some previous work was done on making therowid
column type more flexible - e.g. #14674 - but still only this virtual column was hard-coded in the system.This PR generalizes the concept of virtual columns by adding a new callback to the
TableFunction
, that allows table functions to return the set of virtual columns that they support:As an example, if we wanted to support the
rowid
column as before we would set it up like so:However, we can now support multiple virtual columns and can fully customize the virtual columns that we emit. This PR also introduces two new virtual columns:
COLUMN_IDENTIFIER_FILENAME
, emitted by the Parquet/CSV/JSON readersCOLUMN_IDENTIFIER_EMPTY
- this is a non-queryable virtual column. When this column is present, it is added when runningCOUNT(*)
over the table function.Filename
With the addition of the
filename
virtual column, the following snippet now works:Previously, we could emit the filename only by using the
filename
option. This is now no longer required - we can directly query thefilename
if desired.This PR is primarily adding support for virtual columns - we're planning to extend this support further in the future.