8000 Apply `dataframe.dtype_backend` configuration option globally · Issue #9879 · dask/dask · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Apply dataframe.dtype_backend configuration option globally  #9879
Open
@jrbourbeau

Description

@jrbourbeau

We recently added a dataframe.dtype_backend config option for specifying whether classic numpy-backed dtypes (e.g. int64, float64, etc.) or pyarrow-backed dtypes (e.g. int64[pyarrow], float64[pyarrow], etc.) should be used in a dask DataFrame.

However, today dataframe.dtype_backend is only used in dd.read_parquet. To extend where pyarrow dtypes can be used, and arguably for a more intuitive UX (xref #9840), I think we want dataframe.dtype_backend to work with all methods for creating a dask DataFrame. That is, things like the following

import dask
import dask.dataframe as dd

# Tell dask to use `pyarrow`-backed dtypes
dask.config.set({"dataframe.dtypes_backend": "pyarrow"})

df = dd.read_csv(...)
df = dd.from_pandas(...)
df = dd.from_delayed(...)
...

should all return dask DataFrames backed by pyarrow dtypes.

Some methods, like read_parquet, will want to have a specialized implementation for this. However, in cases where a specialized method isn't implemented, we should still automatically cast the dask DataFrame to use pyarrow dtypes when dataframe.dtype_backend = "pyarrrow". For example, through an (optional) map_partitions call after our existing DataFrame creation logic.

cc @rjzamora @phofl for visibility

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataframeneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0