Description
A common operation in big data processing frameworks is to "explode" struct columns into multiple columns in a single command. I'd like to be able to do this with Dask-cuDF struct columns with a single command, rather than the code shown below. This would be analogous to something like a LATERAL VIEW explode(col)
in Hive.
Can't manipulate struct columns in Dask cuDF yet (#8657 ), nor can I use a struct
accessor to get the fields (#8658), so the following example uses cuDF to illustrate the desired behavior with Dask. For example, given:
import cudf
s = cudf.Series([
{"a":5, "b":10},
{"a":3, "b":7},
{"a":-3, "b":11}
]
)
print(s)
0 {'a': 5, 'b': 10}
1 {'a': 3, 'b': 7}
2 {'a': -3, 'b': 11}
dtype: struct
I'd like to create the following dataframe without explicitly looping through every field, which I can do today with:
results = []
for key in s.dtype.fields:
results.append(s.struct.field(key))
out = cudf.concat(results, axis=1)
out.columns = s.dtype.fields
print(out)
a b
0 5 10
1 3 7
2 -3 11
We currently have an explode
operator, but for now it appears to be a pass-through on struct columns. The explode docstring indicates it's designed for list-like columns. Perhaps this might be an area to explore for this.
s.explode()
0 {'a': 5, 'b': 10}
1 {'a': 3, 'b': 7}
2 {'a': -3, 'b': 11}
dtype: struct