This extension, nanoarrow, allows you to read Arrow IPC streams and files. It serves a similar purpose as the now-deprecated Arrow DuckDB core extension.
However, it comes with the added functionality to query Arrow IPC files and is much better tested. This extension is released as a DuckDB Community Extension.
For compatibility reasons with the previous Arrow core extension, this extension is also aliased as arrow
.
You can install and load it as:
-- arrow would also be a suitable name
INSTALL nanoarrow FROM community;
LOAD nanoarrow;
Below is a complete example of how to use our extension to read an Arrow IPC file.
In addition to our extension, you will also need the httpfs
extension installed and loaded to fetch the data directly from GitHub.
LOAD httpfs;
LOAD nanoarrow;
SELECT
commit, message
FROM
'https://github.com/apache/arrow-experiments/raw/refs/heads/main/data/arrow-commits/arrow-commits.arrows'
LIMIT 10;
┌───────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────┐
│ commit │ message │
│ varchar │ varchar │
├───────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ 49cdb0fe4e98fda19031c86… │ GH-40370: [C++] Define ARROW_FORCE_INLINE for non-MSVC builds (#40372) │
│ 1d966e98e41ce817d1f8c51… │ GH-40386: [Python] Fix except clauses (#40387) │
│ 96f26a89bd73997f7532643… │ GH-40227: [R] ensure executable files in `create_package_with_all_dependencies` (#40232) │
│ ee1a8c39a55f3543a82fed9… │ GH-40366: [C++] Remove const qualifier from Buffer::mutable_span_as (#40367) │
│ 3d467ac7bfae03cf2db0980… │ GH-20127: [Python][CI] Remove legacy hdfs tests from hdfs and hypothesis setup (#40363) │
│ ef6ea6beed071ed070daf03… │ GH-40345: [FlightRPC][C++][Java][Go] Add URI scheme to reuse connection (#40084) │
│ 53e0c745ad491af98a5bf18… │ GH-40153: [C++][Python] Fix test_gdb failures on 32-bit (#40293) │
│ 3ba6d286caad328b8572a3b… │ GH-40059: [C++][Python] Basic conversion of RecordBatch to Arrow Tensor (#40064) │
│ 4ce9a5edd2710fb8bf0c642… │ GH-40153: [Python] Make `Tensor.__getbuffer__` work on 32-bit platforms (#40294) │
│ 2445975162905bd8d9a42ff… │ GH-40334: [C++][Gandiva] Add missing OpenSSL dependency to encrypt_utils_test.cc (#40338) │
├───────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┤
│ 10 rows 2 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
In the remainder of this section, we cover the supported parameters and usages for our IPC readers/writers.
Writing an Arrow IPC file is done using the COPY statement Below is a simple example of how you can use DuckDB to create such a file.
COPY (SELECT 42 as foofy, 'string' as stringy) TO "test.arrows";
Both .arrows
and .arrow
will be automatically recognized by DuckDB as Arrow IPC streams.
However, if you wish to use a different extension, you can manually specify the format using:
COPY (SELECT 42 as foofy, 'string' as stringy) TO "test.ipc" (FORMAT ARROWS);
The Copy function of the Copy To Arrow File operation accepts the following parameters:
row_group_size
: The size of a row group. By default, the value is 122,880. A lower value may reduce performance but can be beneficial for streaming. It is important to note that this value is not exact, a slightly higher value divisible by 2,048 (DuckDB's standard vector size) may be used as the actual row group size.chunk_size
: An alias for therow_group_size
parameter.row_group_size_bytes
: The size of row groups in bytes.row_groups_per_file
: The maximum number of row groups per file. If this option is set, multiple files can be generated in a singleCOPY
call. This means the specified path will create a directory, and therow_group_size
parameter will also be used to determine the partition sizes.kv_metadata
: Key-value metadata to be added to the file schema.
If row_group_size_bytes
and either chunk_size
or row_group_size
are used, the row groups will be defined by the smallest of these parameters.
You can consume the file using the read_arrow
scanner. For example, to read the file we just created, you could run:
FROM read_arrow('test.arrows');
Similar to the copy function, the extension also registers .arrows
and .arrow
as valid extensions for the Arrow IPC format. This means that a replacement scan can be applied if that is the file extension, so the following would also be a valid query:
FROM 'test.arrows';
Besides single-file reading, our extension also fully supports multi-file reading, including all valid multi-file options.
If we were to create a second test file using:
COPY (SELECT 42 as foofy, 'string' as stringy) TO "test_2.arrows" (FORMAT ARROWS);
We can then run a query that reads both files using a glob pattern or a list of file paths:
-- Glob
FROM read_arrow('*.arrows')
-- List
FROM read_arrow(['test.arrows','test_2.arrows'])
When reading multiple files, the following parameters are also supported:
union_by_name
: If the schemas of the files differ, settingunion_by_name
allows DuckDB to construct the schema by aligning columns with the same name.filename
: If set toTrue
, this will add a column with the name of the file that generated each row.hive_partitioning
: Enables reading data from a Hive-partitioned dataset and applies partition filtering.
Note
Arrow IPC files (.arrow) and Arrow IPC streams (.arrows) are distinct but related formats. This extension can read both but only writes Arrow IPC Streams.
Similar to the old core Arrow extension, this extension also allows direct production and consumption of the Arrow IPC streaming format from in-memory buffers in both Python and Node.js. In this section, we will demonstrate how to use the Python API, but you can find many tests that serve as examples for both Node.js and Python.
Our extension can create Arrow IPC buffers using the to_arrow_ipc
function. This function returns two columns: one containing the serialized data as a BLOB
, and another BOOL
column indicating which tuples contain the header information of the messages. For example, consider the following table in our DuckDB database:
import pyarrow as pa
import duckdb
import pyarrow.ipc as ipc
connection = duckdb.connect()
connection.execute("CREATE TABLE T (f0 integer, f1 varchar, f2 bool )")
connection.execute("INSERT INTO T values (1, 'foo', true),(2, 'bar', NULL), (3, 'baz', false), (4, NULL, true) ")
We can then obtain our buffers by simply issuing a to_arrow_ipc
call, like this:
buffers = connection.execute("FROM to_arrow_ipc((FROM T))").fetchall()
In this case, our buffers will contain two tuples: the first is the header of our message, and the second is the data. To convert this into an Arrow table, we simply concatenate the tuples and use the ipc.RecordBatchStreamReader
. For example, you can read them as follows:
batches = []
with pa.BufferReader(pa.py_buffer(buffers[0][0] + buffers[1][0])) as reader:
stream_reader = ipc.RecordBatchStreamReader(reader)
schema = stream_reader.schema
batches.extend(stream_reader)
arrow_table = pa.Table.from_batches(batches, schema=schema)
To read buffers with DuckDB, you must use the Python function from_arrow
. Continuing from our example, we would first need to convert our Arrow table into the Arrow IPC format.
batch = arrow_table.to_batches()[0]
sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, batch.schema) as writer:
writer.write_batch(batch)
buffer = sink.getvalue()
buf_reader = pa.BufferReader(buffer)
msg_reader = ipc.MessageReader.open_stream(buf_reader)
After this, the following query will return a DuckDB relation with the deserialized Arrow IPC:
connection.from_arrow(msg_reader)
To build the extension, clone the repository with submodules:
git clone --recurse-submodules https://github.com/paleolimbot/duckdb-nanoarrow.git
...or if you forget to clone the submodules/you're using VSCode to do your checkout, you can run:
git submodule init
git submodule update --checkout
A quick-and-dirty way to get your build up and running is to run make
:
make
The main binaries that will be built are:
./build/release/duckdb
./build/release/test/unittest
./build/release/extension/nanoarrow/nanoarrow.duckdb_extension
duckdb
is the binary for the duckdb shell with the extension code automatically loaded.unittest
is the test runner of duckdb. Again, the extension is already linked into the binary.nanoarrow.duckdb_extension
is the loadable binary as it would be distributed.
If you'd like to use VSCode with the integration provided by the CMake/clangd extension, you can run:
cp CMakeUserPresets.json duckdb/
...and ensure that .vscode/settings.json
contains:
{
"cmake.sourceDirectory": "${workspaceFolder}/duckdb"
}
Then choose Developer: Reload window from the command palette and choose the Extension (Debug build) preset.
To run the extension code, simply start the shell with ./build/release/duckdb
(if you're using make
to build) or ./build/duckdb
(if you're using CMake
via VSCode).
Now we can use the features from the extension directly in DuckDB.
Different tests can be created for DuckDB extensions. Tests are written in
SQL ./test/sql
. These SQL tests can be run using make test
(if using
make) or ./test_local.sh
(if using CMake via VSCode).
You can debug an interactive SQL session by launching it with gdb
or lldb
:
lldb build/duckdb
...or you can use the CodeLLDB extension (Command Palette: LLDB: Attach to process) to launch a VSCode interactive debugger launched in a terminal.