Description
This is similar to #9879, but smaller in scope.
Motivation
We've seen several cases where using pyarrow
strings for text data have significant memory usage / computation performance improvements (xref #9631, dask/community#301). We should make it easy for users to use utilize this performant data type.
Proposal
I'll propose we add a config option users can set to automatically convert object
and string[python]
data that's encountered to string[pyarrow]
. We'll want this to work with all methods for creating dask DataFrame
s. That is, things like the following
import dask
import dask.dataframe as dd
# Tell dask to use `pyarrow`-strings for object dtypes
dask.config.set({"dataframe.object_as_pyarrow_string": True}) # Suggestions for a better name are welcome!
df = dd.read_parquet(...)
df = dd.read_csv(...)
df = dd.from_pandas(...)
df = dd.from_delayed(...)
...
should all return dask DataFrame
s that use string[pyarrow]
appropriately.
For some methods, like read_parquet
, we'll want to have a specialized implementation as they'll be able to efficiently read data directly into string[pyarrow]
. However, in cases where a specialized method isn't implemented, we should still automatically cast the dask DataFrame
to use string[pyarrow]
when the config option is set. For example, through an map_partitions
call after our existing DataFrame
creation logic.
Steps
Steps that I think make sense here are:
- Add a config option that automatically converts to
string[pyarrow]
dtype where appropriate (see Add option for converting string data to usepyarrow
strings #9926) - Add specialized implementation for
dd.read_parquet
(see Efficientdataframe.convert_string
support forread_parquet
#9979) - Add a CI job with the new config option turned on (see Add CI job with
pyarrow
strings turned on #10017) - Fix all test failures in the new CI job
- Add documentation
- (Optional) Make it easy to always use performant
string[pyarrow]
(e.g. emit a performance warning when using text data withoutstring[pyarrow]
, turn the config option on by default, etc).
Notes
See #9926 where I'm taking an initial pass at adding the config option.