8000 [GPU] Cross lane reduction rather than serial by pashu123 · Pull Request #20680 · iree-org/iree · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[GPU] Cross lane reduction rather than serial #20680

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 1, 2025

Conversation

pashu123
Copy link
Contributor

The reduction across subgroups were happening serially i.e., each thread were doing the entire reduction. Now we distribute the values from shared memory among threads and do subgroup reduction.

@pashu123 pashu123 marked this pull request as draft April 30, 2025 01:31
@pashu123 pashu123 force-pushed the enabshuffle branch 2 times, most recently from dea2cd4 to 602d7bb Compare May 1, 2025 10:18
@pashu123 pashu123 marked this pull request as ready for review May 1, 2025 10:18
The reduction across subgroups were happening serially i.e., each thread
were doing the entire reduction. Now we distribute the values from
shared memory among threads and do subgroup reduction.
Copy link
Contributor
@Groverkss Groverkss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM, I have a comment for @qedawkins to respond, please wait for that, otherwise good to land.

SmallVector<bool> inBounds(unDistributedType.getRank(), true);
SmallVector<bool> inBounds(unDistributedType.getRank(), false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this is okay for now, I would prefer generating masks, but I'm not sure whats best to codegen here. @qedawkins Do you have any recommendations here?

For context, this is making 64 (subgroup_size) threads access an array of at max 16 elements, so some of them will go out of bounds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go pull an architecture manual to make sure that LDS isn't silently padded either so that we can skip the conditional on the read?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lowering is architecture independent, so that should be really checked in the transfer_read lowering, not here. The same implementation should work for CUDA as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this for now. I'll create an issue for mask vs in_bounds, and we can discuss further.

@pashu123 pashu123 merged commit 4cfbda8 into iree-org:main May 1, 2025
42 checks passed
nirvedhmeshram pushed a commit that referenced this pull request May 6, 2025
The reduction across subgroups were happening serially i.e., each thread
were doing the entire reduction. Now we distribute the values from
shared memory among threads and do subgroup reduction.
KyleHerndon pushed a commit to KyleHerndon/iree that referenced this pull request May 7, 2025
The reduction across subgroups were happening serially i.e., each thread
were doing the entire reduction. Now we distribute the values from
shared memory among threads and do subgroup reduction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0