8000 Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers by Mytherin · Pull Request #16248 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Generalize rowid into the concept of virtual columns, and make filename a virtual column in the Parquet/CSV/JSON readers #16248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Feb 17, 2025

Conversation

Mytherin
Copy link
Collaborator

This PR generalizes the rowid column to the broader concept of virtual columns. Previously, rowid was a special column that existed within DuckDB tables. This column was special in the sense that it could be queried, but would not show up as an actual column of the table when inspecting the table or when running SELECT * FROM tbl:

D create table tbl(i int);
D insert into tbl values (42);
D select * from tbl;
┌───────┐
│   i   │
│ int32 │
├───────┤
│  42   │
└───────┘
D describe tbl;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varcharvarcharvarcharvarcharvarcharvarchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ i           │ INTEGER     │ YES     │ NULLNULLNULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
-- but we can query the column if explicitly mentioned
D select rowid, i from tbl;
┌───────┬───────┐
│ rowid │   i   │
│ int64 │ int32 │
├───────┼───────┤
│   042   │
└───────┴───────┘

The rowid column is also used internally for several purposes, such as handling deletes/updates of rows, or performing late materialization.

In the current implementation - only the rowid virtual column exists, and only tables can implement this virtual column. Some previous work was done on making the rowid column type more flexible - e.g. #14674 - but still only this virtual column was hard-coded in the system.

This PR generalizes the concept of virtual columns by adding a new callback to the TableFunction, that allows table functions to return the set of virtual columns that they support:

struct TableColumn {
	string name;
	LogicalType type;
};

using virtual_column_map_t = unordered_map<column_t, TableColumn>;

typedef virtual_column_map_t (*table_function_get_virtual_columns_t)(ClientContext &context,
                                                                     optional_ptr<FunctionData> bind_data);

As an example, if we wanted to support the rowid column as before we would set it up like so:

virtual_column_map_t virtual_columns;
virtual_columns.insert(make_pair(COLUMN_IDENTIFIER_ROW_ID, TableColumn("rowid", LogicalType::ROW_TYPE)));

However, we can now support multiple virtual columns and can fully customize the virtual columns that we emit. This PR also introduces two new virtual columns:

  • COLUMN_IDENTIFIER_FILENAME, emitted by the Parquet/CSV/JSON readers
  • COLUMN_IDENTIFIER_EMPTY - this is a non-queryable virtual column. When this column is present, it is added when running COUNT(*) over the table function.

Filename

With the addition of the filename virtual column, the following snippet now works:

D copy (select 42 i) to tbl.parquet;
D select * from tbl.parquet;
┌───────┐
│   i   │
│ int32 │
├───────┤
│  42   │
└───────┘
D select *, filename from tbl.parquet;
┌───────┬─────────────┐
│   i   │  filename   │
│ int32 │   varchar   │
├───────┼─────────────┤
│  42tbl.parquet │
└───────┴─────────────┘

Previously, we could emit the filename only by using the filename option. This is now no longer required - we can directly query the filename if desired.

This PR is primarily adding support for virtual columns - we're planning to extend this support further in the future.

@Mytherin Mytherin added the Needs Documentation Use for issues or PRs that require changes in the documentation label Feb 14, 2025
@Mytherin Mytherin merged commit 7ab1893 into duckdb:main Feb 17, 2025
49 checks passed
Antonov548 added a commit to Antonov548/duckdb-r that referenced this pull request Feb 27, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248)
Include extension_util.hpp in libduckdb (duckdb/duckdb#16255)
krlmlr pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 3, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248)
Include extension_util.hpp in libduckdb (duckdb/duckdb#16255)
Antonov548 added a commit to Antonov548/duckdb-r that referenced this pull request Mar 4, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248)
krlmlr pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 5, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248)
Include extension_util.hpp in libduckdb (duckdb/duckdb#16255)
krlmlr pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 5, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248)
@Mytherin Mytherin deleted the virtualcolumns branch April 2, 2025 09:24
Mytherin added a commit that referenced this pull request Apr 4, 2025
… columns in the MultiFileReader (#16979)

Follow-up from #16248

This PR reworks the `file_row_number` to be a virtual column in the
Parquet reader, so the following query now works:

```sql
SELECT l_orderkey, file_row_number FROM lineitem.parquet;
```

This PR also implements the necessary infrastructure for allowing
arbitrary virtual columns to be defined by readers, so in the future
adding new virtual columns to readers will be much simpler.

This rework allows for the removal of a bunch of hacky special-case code
around the `file_row_number` column - this can now all live in the
Parquet reader itself. Emitting the file row number is as simple as
adding the special code
(`MultiFileReader::COLUMN_IDENTIFIER_FILE_ROW_NUMBER `) to the set of
projected column ids.
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248)
Include extension_util.hpp in libduckdb (duckdb/duckdb#16255)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248)
Include extension_util.hpp in libduckdb (duckdb/duckdb#16255)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248)
Include extension_util.hpp in libduckdb (duckdb/duckdb#16255)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Generalize `rowid` into the concept of virtual columns, and make `filename` a virtual column in the Parquet/CSV/JSON readers (duckdb/duckdb#16248)
Include extension_util.hpp in libduckdb (duckdb/duckdb#16255)
Mytherin added a commit that referenced this pull request May 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Documentation Use for issues or PRs that require changes in the documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0