[GPU] Cross lane reduction rather than serial #20680

pashu123 · 2025-04-30T00:57:02Z

The reduction across subgroups were happening serially i.e., each thread were doing the entire reduction. Now we distribute the values from shared memory among threads and do subgroup reduction.

Groverkss

Nice! LGTM, I have a comment for @qedawkins to respond, please wait for that, otherwise good to land.

Groverkss · 2025-05-01T11:46:43Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPUNestedLayoutDistributionPatterns.cpp

-    SmallVector<bool> inBounds(unDistributedType.getRank(), true);
+    SmallVector<bool> inBounds(unDistributedType.getRank(), false);


hmm this is okay for now, I would prefer generating masks, but I'm not sure whats best to codegen here. @qedawkins Do you have any recommendations here?

For context, this is making 64 (subgroup_size) threads access an array of at max 16 elements, so some of them will go out of bounds.

I'd go pull an architecture manual to make sure that LDS isn't silently padded either so that we can skip the conditional on the read?

The lowering is architecture independent, so that should be really checked in the transfer_read lowering, not here. The same implementation should work for CUDA as well.

I'll merge this for now. I'll create an issue for mask vs in_bounds, and we can discuss further.

The reduction across subgroups were happening serially i.e., each thread were doing the entire reduction. Now we distribute the values from shared memory among threads and do subgroup reduction.

pashu123 requested review from MaheshRavishankar, qedawkins, kuhar, Groverkss and antiagainst as code owners April 30, 2025 00:57

pashu123 force-pushed the enabshuffle branch from 5a0b043 to abfdbf3 Compare April 30, 2025 01:00

pashu123 marked this pull request as draft April 30, 2025 01:31

pashu123 force-pushed the enabshuffle branch 2 times, most recently from dea2cd4 to 602d7bb Compare May 1, 2025 10:18

pashu123 marked this pull request as ready for review May 1, 2025 10:18

pashu123 force-pushed the enabshuffle branch from 602d7bb to e05c363 Compare May 1, 2025 10:23

[GPU] Cross lane reduction rather than serial

8f3e39b

The reduction across subgroups were happening serially i.e., each thread were doing the entire reduction. Now we distribute the values from shared memory among threads and do subgroup reduction.

pashu123 force-pushed the enabshuffle branch from e05c363 to 8f3e39b Compare May 1, 2025 10:58

Groverkss reviewed May 1, 2025

View reviewed changes

Groverkss approved these changes May 1, 2025

View reviewed changes

pashu123 merged commit 4cfbda8 into iree-org:main May 1, 2025
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GPU] Cross lane reduction rather than serial #20680

[GPU] Cross lane reduction rather than serial #20680

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		SmallVector<bool> inBounds(unDistributedType.getRank(), true);
		SmallVector<bool> inBounds(unDistributedType.getRank(), false);

[GPU] Cross lane reduction rather than serial #20680

[GPU] Cross lane reduction rather than serial #20680

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!