-
Notifications
You must be signed in to change notification settings - Fork 703
[GPU] Cross lane reduction rather than serial #20680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dea2cd4
to
602d7bb
Compare
The reduction across subgroups were happening serially i.e., each thread were doing the entire reduction. Now we distribute the values from shared memory among threads and do subgroup reduction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM, I have a comment for @qedawkins to respond, please wait for that, otherwise good to land.
SmallVector<bool> inBounds(unDistributedType.getRank(), true); | ||
SmallVector<bool> inBounds(unDistributedType.getRank(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this is okay for now, I would prefer generating masks, but I'm not sure whats best to codegen here. @qedawkins Do you have any recommendations here?
For context, this is making 64 (subgroup_size) threads access an array of at max 16 elements, so some of them will go out of bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd go pull an architecture manual to make sure that LDS isn't silently padded either so that we can skip the conditional on the read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lowering is architecture independent, so that should be really checked in the transfer_read lowering, not here. The same implementation should work for CUDA as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll merge this for now. I'll create an issue for mask vs in_bounds, and we can discuss further.
The reduction across subgroups were happening serially i.e., each thread were doing the entire reduction. Now we distribute the values from shared memory among threads and do subgroup reduction.
The reduction across subgroups were happening serially i.e., each thread were doing the entire reduction. Now we distribute the values from shared memory among threads and do subgroup reduction.
The reduction across subgroups were happening serially i.e., each thread were doing the entire reduction. Now we distribute the values from shared memory among threads and do subgroup reduction.