Apply dataframe.dtype_backend configuration option globally

@rjzamora

We recently added a dataframe.dtype_backend config option for specifying whether classic numpy-backed dtypes (e.g. int64, float64, etc.) or pyarrow-backed dtypes (e.g. int64[pyarrow], float64[pyarrow], etc.) should be used in a dask DataFrame.

However, today dataframe.dtype_backend is only used in dd.read_parquet. To extend where pyarrow dtypes can be used, and arguably for a more intuitive UX (xref #9840), I think we want dataframe.dtype_backend to work with all methods for creating a dask DataFrame. That is, things like the following

import dask
import dask.dataframe as dd

# Tell dask to use `pyarrow`-backed dtypes
dask.config.set({"dataframe.dtypes_backend": "pyarrow"})

df = dd.read_csv(...)
df = dd.from_pandas(...)
df = dd.from_delayed(...)
...

should all return dask DataFrames backed by pyarrow dtypes.

Some methods, like read_parquet, will want to have a specialized implementation for this. However, in cases where a specialized method isn't implemented, we should still automatically cast the dask DataFrame to use pyarrow dtypes when dataframe.dtype_backend = "pyarrrow". For example, through an (optional) map_partitions call after our existing DataFrame creation logic.

cc @rjzamora @phofl for visibility

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions