[FEA] Explode struct column into multiple columns with Dask-cuDF

A common operation in big data processing frameworks is to "explode" struct columns into multiple columns in a single command. I'd like to be able to do this with Dask-cuDF struct columns with a single command, rather than the code shown below. This would be analogous to something like a LATERAL VIEW explode(col) in Hive.

Can't manipulate struct columns in Dask cuDF yet (#8657 ), nor can I use a struct accessor to get the fields (#8658), so the following example uses cuDF to illustrate the desired behavior with Dask. For example, given:

import cudf

s = cudf.Series([
    {"a":5, "b":10},
    {"a":3, "b":7},
    {"a":-3, "b":11}
    ]
)
print(s)
0     {'a': 5, 'b': 10}
1      {'a': 3, 'b': 7}
2    {'a': -3, 'b': 11}
dtype: struct

I'd like to create the following dataframe without explicitly looping through every field, which I can do today with:

results = []
for key in s.dtype.fields:
    results.append(s.struct.field(key))

out = cudf.concat(results, axis=1)
out.columns = s.dtype.fields
print(out)
   a   b
0  5  10
1  3   7
2 -3  11

We currently have an explode operator, but for now it appears to be a pass-through on struct columns. The explode docstring indicates it's designed for list-like columns. Perhaps this might be an area to explore for this.

s.explode()
0     {'a': 5, 'b': 10}
1      {'a': 3, 'b': 7}
2    {'a': -3, 'b': 11}
dtype: struct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions