8000 Make `pyarrow` strings easy to use · Issue #9946 · dask/dask · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Make pyarrow strings easy to use #9946
Open
@jrbourbeau

Description

@jrbourbeau

This is similar to #9879, but smaller in scope.

Motivation

We've seen several cases where using pyarrow strings for text data have significant memory usage / computation performance improvements (xref #9631, dask/community#301). We should make it easy for users to use utilize this performant data type.

Proposal

I'll propose we add a config option users can set to automatically convert object and string[python] data that's encountered to string[pyarrow]. We'll want this to work with all methods for creating dask DataFrames. That is, things like the following

import dask
import dask.dataframe as dd

# Tell dask to use `pyarrow`-strings for object dtypes
dask.config.set({"dataframe.object_as_pyarrow_string": True})  # Suggestions for a better name are welcome! 

df = dd.read_parquet(...)
df = dd.read_csv(...)
df = dd.from_pandas(...)
df = dd.from_delayed(...)
...

should all return dask DataFrames that use string[pyarrow] appropriately.

For some methods, like read_parquet, we'll want to have a specialized implementation as they'll be able to efficiently read data directly into string[pyarrow]. However, in cases where a specialized method isn't implemented, we should still automatically cast the dask DataFrame to use string[pyarrow] when the config option is set. For example, through an map_partitions call after our existing DataFrame creation logic.

Steps

Steps that I think make sense here are:

Notes

See #9926 where I'm taking an initial pass at adding the config option.

cc @rjzamora @quasiben @j-bennet @phofl for visibility

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataframefeatureSomething is missingioneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0