8000 [FEA] Explode struct column into multiple columns with Dask-cuDF · Issue #8660 · rapidsai/cudf · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[FEA] Explode struct column into multiple columns with Dask-cuDF #8660
Closed
@beckernick

Description

@beckernick

A common operation in big data processing frameworks is to "explode" struct columns into multiple columns in a single command. I'd like to be able to do this with Dask-cuDF struct columns with a single command, rather than the code shown below. This would be analogous to something like a LATERAL VIEW explode(col) in Hive.

Can't manipulate struct columns in Dask cuDF yet (#8657 ), nor can I use a struct accessor to get the fields (#8658), so the following example uses cuDF to illustrate the desired behavior with Dask. For example, given:

import cudfs = cudf.Series([
    {"a":5, "b":10},
    {"a":3, "b":7},
    {"a":-3, "b":11}
    ]
)
print(s)
0     {'a': 5, 'b': 10}
1      {'a': 3, 'b': 7}
2    {'a': -3, 'b': 11}
dtype: struct

I'd like to create the following dataframe without explicitly looping through every field, which I can do today with:

results = []
for key in s.dtype.fields:
    results.append(s.struct.field(key))
​
out = cudf.concat(results, axis=1)
out.columns = s.dtype.fields
print(out)
   a   b
0  5  10
1  3   7
2 -3  11

We currently have an explode operator, but for now it appears to be a pass-through on struct columns. The explode docstring indicates it's designed for list-like columns. Perhaps this might be an area to explore for this.

s.explode()
0     {'a': 5, 'b': 10}
1      {'a': 3, 'b': 7}
2    {'a': -3, 'b': 11}
dtype: struct

Metadata

Metadata

Assignees

Labels

PythonAffects Python cuDF API.daskDask issuefeature requestNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0