Description
We recently added a dataframe.dtype_backend
config option for specifying whether classic numpy
-backed dtypes (e.g. int64
, float64
, etc.) or pyarrow
-backed dtypes (e.g. int64[pyarrow]
, float64[pyarrow]
, etc.) should be used in a dask DataFrame.
However, today dataframe.dtype_backend
is only used in dd.read_parquet
. To extend where pyarrow
dtypes can be used, and arguably for a more intuitive UX (xref #9840), I think we want dataframe.dtype_backend
to work with all methods for creating a dask DataFrame. That is, things like the following
import dask
import dask.dataframe as dd
# Tell dask to use `pyarrow`-backed dtypes
dask.config.set({"dataframe.dtypes_backend": "pyarrow"})
df = dd.read_csv(...)
df = dd.from_pandas(...)
df = dd.from_delayed(...)
...
should all return dask DataFrames backed by pyarrow
dtypes.
Some methods, like read_parquet
, will want to have a specialized implementation for this. However, in cases where a specialized method isn't implemented, we should still automatically cast the dask DataFrame to use pyarrow
dtypes when dataframe.dtype_backend = "pyarrrow"
. For example, through an (optional) map_partitions
call after our existing DataFrame creation logic.