[macOS GPU Support] Optimize `findBlocksWithInteractions` for Apple silicon GPUs #3959

philipturner · 2023-02-12T04:11:20Z

The Apple Metal/AIR backend compiler does not auto-inline loops, even when attempting to force-inline it. This should speedup the kernel by under 19%, speeding up the actual performance by the order of 1% in some benchmarks.

Resolves #3924

peastman · 2023-02-14T22:40:26Z

Here are the speed differences I see on the benchmarks with this change.

	Before	After	Ratio
gbsa	408.387	406.496	0.995
rf	255.078	258.069	1.012
pme	147.489	148.896	1.010
apoa1rf	59.8386	61.0192	1.020
apoa1pme	35.2012	35.4637	1.007
apoa1ljpme	26.1992	26.4046	1.008
amoebagk	0.535711	0.538338	1.005
amoebapme	2.16664	2.14889	0.992
amber20-dhfr	155.635	156.501	1.006
amber20-cellulose	6.96229	7.02266	1.009
amber20-stmv	1.82035	1.84319	1.013

All differences are within the noise level. I'm not sure it's worth adding complexity to the code for a negligible speedup.

philipturner · 2023-02-14T23:12:48Z

Could you rerun gbsa and amoebapme for about 3 additional trials? There seems to be a statistically significant, universal +1% speedup like I predicted. The only way this would harm performance is by thrashing the instruction cache, if the inner 32-loop had 375 bytes of instructions (47 instructions).

We could also spot other loops in the code base that would benefit from the optimization. Then the speedup might reach 2% or 3%.

peastman · 2023-02-14T23:39:43Z

We could also spot other loops in the code base that would benefit from the optimization. Then the speedup might reach 2% or 3%.

Let's see if we can do that. I'm just hesitant to add complexity to an already complex kernel for a 1% speedup on a subset of benchmarks. Especially since it only happens on macOS, and will likely become unnecessary in the future. Loop unrolling is a standard optimization most compilers do. It's surprising that Apple's doesn't, and it's likely they'll add it in the future.

philipturner · 2023-02-14T23:43:12Z

It's surprising that Apple's doesn't, and it's likely they'll add it in the future.

I've seen instances where they do; it just unrolls a small amount, and you can't control it. It might unroll 2-4 small iterations but not 32. Probably a heuristic to account for the M1's comparatively small L1I.

peastman · 2023-02-15T01:37:14Z

And I'm guessing they don't support #pragma unroll 32?

philipturner · 2023-02-15T02:10:21Z

That doesn't do anything. I remember trying it here and saw no change. The repo provides a very simple way to test how something affects this kernel's performance.

bdenhollander · 2023-02-15T12:32:41Z

We could also spot other loops in the code base that would benefit from the optimization. Then the speedup might reach 2% or 3%.

Try unrolling some of the < TILE_SIZE loops, which are typically 32 iterations. computeGKForces and computeElectrostatics would be targets for amoebagk and amoebapme respectively.

philipturner · 2023-02-15T13:58:08Z

computeGKForces and computeElectrostatics would be targets for amoebagk and amoebapme respectively.

That would probably harm performance, as the loops are too large. Perhaps force-unrolling up to four iterations might allow more ILP, but we can't overflow the instruction cache.

philipturner · 2023-02-17T12:20:29Z

Ultimately the purpose is to try to implement the optimizations from #3924. They can only be fully realized with Metal, creating a 3% speedup (no half). Limiting to OpenCL features makes this only 1%.

philipturner · 2023-02-25T03:34:31Z

Converted to draft because #3979

philipturner · 2023-03-09T23:59:57Z

I just reached 129 -> 138 ns/day on apoa1rf - see philipturner/openmm-metal@830c2e8. The matrix multiply optimization harmed performance, likely because there was already so much threadgroup traffic. In my benchmarking repo, I had simplified the shader and removed a lot of traffic. I think we can see ever greater speedups by incorporating the full CUDA implementation for M1 and maybe RDNA or GCN 3, now that we can use SIMD ballots. The CUDA implementation is modestly different and uses a saveSinglePairs kernel. It also proactively skips certain internal loop iterations, which could boost ns/day beyond the theoretical maximum I predicted.

Before re-opening this PR, I'll need to split the Metal header incorporation into another PR. I need to figure out a way to minimize compiler overhead and license the external dependency properly. Can we build a .pch during the CMake build and embed into a string literal, similar to the .cl files?

And for the subsequent PR, how about the translate the CUDA findBlocksWithInterations implementation into the common compute language? That reduces the headache of maintaining two different OpenCL versions.

peastman · 2023-03-10T00:44:33Z

Don't worry about the saveSinglePairs() optimization. The effect is tiny, maybe a 1-2% speedup at best. It also is fairly brittle and needs to be re-tuned with every generation. Even on CUDA it's probably more trouble than it's worth.

The biggest optimization is that we can used a __ballot_sync() combined with a __popc() to do a prefix sum of the binary flags over a whole warp in just a few instructions.

philipturner · 2023-03-10T02:29:16Z

I tried that and now half of the tests are failing. The apoa1rf benchmark is silently crashing. Do you have any idea what's happening?

philipturner/openmm-metal@a0ebbd3

philipturner · 2023-03-10T17:15:05Z

Even without the ballot optimization, some impressive improvements. I still need to tune force thread blocks for AMOEBA as @bdenhollander suggested on the Mac AMD thread. Going by RTX 4080 for comparison, we can reach 141 ns/day on M1 Max apoa1rf. Going by RTX 2080 = 2x its own TFLOPS, 150 ns/day.

Optimization 1: fixing SIMD width, thread blocks, and loop unrolling for findBlocksWithInteractions
Optimization 2: fixing command queue flushing
Optimization 3: switching to simd_prefix_inclusive_sum in findBlocksWithInteractions

ns/day	OpenMM 8.0.0	Opt. 1	Opt. 2	Opt. 3	Extrapolating 2080
gbsa	264	378.0	632.8	631.8	974.9
rf	153	304.5	421.5	452.8	568.5
pme	120	199.6	246.8	258.6	463.5
apoa1rf	42	114.9	129.0	138.5	149.7
apoa1pme	33	67.2	72.3	75.3	103.7
apoa1ljpme	28	50.2	53.5	55.1	83.6
amoebagk	0.8	1.184	1.187	1.185	8.512
amoebapme	2.72	3.97	4.00	4.05	7.72
amber20-dhfr	TBD	212.0	267.0	275.7	TBD
amber20-cellulose	TBD	14.36	14.78	15.69	17.72
amber20-stmv	TBD	3.83	3.88	4.29	5.23

Speed vs 8.0.0	Opt. 1	Opt. 2	Opt. 3	Extrapolating 2080
gbsa	+43%	+140%	+139%	+269%
rf	+99%	+175%	+196%	+271%
pme	+66%	+106%	+116%	+286%
apoa1rf	+174%	+207%	+230%	+256%
apoa1pme	+104%	+119%	+128%	+214%
apoa1ljpme	+79%	+91%	+97%	+199%
amoebagk	+48%	+48%	+48%	+964%
amoebapme	+46%	+47%	+49%	+184%

Peak Watts	Opt. 1	Opt. 2	Opt. 3
gbsa	13.2	22.1	21.6
rf	20.0	27.7	28.0
pme	16.8	21.2	21.0
apoa1rf	30.8	35.3	35.8
apoa1pme	23.6	25.6	25.3
apoa1ljpme	23.5	25.0	24.8
amoebagk	15.4	15.3	15.2
amoebapme	17.3	17.3	17.5
amber20-dhfr	16.9	21.1	21.5
amber20-cellulose	27.5	28.1	28.6
amber20-stmv	27.1	27.4	28.5

Optimization 3 does not change watts, and improves everything except AMOEBA, GBSA, GK.

bdenhollander · 2023-03-19T02:07:00Z

@philipturner There are loops in gridInterpolateForce and gridSpreadCharge that AMD unrolls. Search for for (int ix = 0; ix < PME_ORDER; ix++) and try unrolling all three levels or the innermost two depending on how many registers are available. This may improve performance on pme benchmarks.

PME_ORDER is set to PmeOrder, which is defined as static const int PmeOrder = 5;

philipturner · 2023-03-22T23:38:55Z

If I'm going to maintain a separate Metal backend after all, I think there's less need to integrate the entire header into OpenMM. @peastman could we just paste the declaration for simd_prefix_inclusive_sum(float) and simd_broadcast(float, ushort), and not worry about the header?

peastman · 2023-03-23T17:02:57Z

That seems reasonable.

philipturner · 2023-03-23T23:59:04Z

I got GPT-4's help on the ballot optimization bug. It isn't a silver bullet, but it outlined an incremental approach to test the CUDA optimization bit by bit.

https://gist.github.com/philipturner/dfe2d99a28bf5771c6bcfa589264b5b0

@bdenhollander can you verify this statement? Some of its statements are completely false/hallucinated, but the good statements are very insightful.

“As for the CMPSELECT instruction, I have not found any evidence that the AMD GPU has a similar instruction that can perform two integer operations in one instruction. However, the AMD GPU has other instructions that can perform multiple operations in one instruction, such as V_MAD_I32_I24 or V_MAD_U64_U32. These instructions can perform two integer multiplications and one integer addition in one instruction. These instructions may have different performance characteristics than the CMPSELECT instruction, depending on the data types, the operands, and the latency.”

bdenhollander · 2023-03-24T00:29:46Z

V_MAD_I32_I24 and V_MAD_U64_U32 are mentioned in all variants of the RDNA ISAs: RDNA, RDNA2, and RDNA3.

VOP3 instructions might be what you're looking for.

philipturner · 2023-03-24T01:16:18Z

VOP3 instructions might be what you're looking for.

Apple's CMPSEL instruction takes 4 input operands, a massive 80 bits to encode. For example, it is used for the max and select operations. The hardware may also use it for if(x == y) {} else {}, fusing the comparison and branch selection.

max(x,y): CMPSEL x > y ? x : y
select(x,y, a < b): CMPSEL a < b ? x : y

AMD only supports 3 inputs. Can it fuse a comparison and selection into a single cycle?

philipturner · 2023-03-24T02:16:59Z

GPT-4 was right. The bug stems from CUDA __ffs having very different behavior than OpenCL ctz.

The count trailing zeros and count leading zeros functions may not work as expected on OpenCL devices. These functions are also vendor-specific functions that may have different implementations or behaviors on different devices. For example, some devices may return undefined values for zero inputs or have different rounding modes for floating-point inputs. You may want to check the documentation of your device and the OpenCL specification to see if there are any differences or limitations for these functions. You may also want to test these functions with some simple inputs and outputs to see if they match your expectations.

philipturner · 2023-03-24T03:37:34Z

Optimization 4: ballot optimization. Look at apoa1rf, amber20-cellulose, and amber20-stmv.

ns/day	OpenMM 8.0.0	Opt. 1	Opt. 2	Opt. 3	Opt. 4	Extrapolating 2080
gbsa	264	378.0	632.8	631.8	635.0	974.9
rf	153	304.5	421.5	452.8	476.3	568.5
pme	120	199.6	246.8	258.6	267.2	463.5
apoa1rf	42	114.9	129.0	138.5	151.4	149.7
apoa1pme	33	67.2	72.3	75.3	79.0	103.7
apoa1ljpme	28	50.2	53.5	55.1	56.9	83.6
amoebagk	0.8	1.184	1.187	1.185	1.189	8.512
amoebapme	2.72	3.97	4.00	4.05	4.07	7.72
amber20-dhfr	131.7	212.0	267.0	275.7	279.2	TBD
amber20-cellulose	5.92	14.36	14.78	15.69	19.93	17.72
amber20-stmv	1.40	3.83	3.88	4.29	6.89	5.23

Speed vs 8.0.0	Opt. 1	Opt. 2	Opt. 3	Opt. 4	Extrapolating 2080
gbsa	+43%	+140%	+139%	+141%	+269%
rf	+99%	+175%	+196%	+211%	+271%
pme	+66%	+106%	+116%	+123%	+286%
apoa1rf	+174%	+207%	+230%	+260%	+256%
apoa1pme	+104%	+119%	+128%	+139%	+214%
apoa1ljpme	+79%	+91%	+97%	+103%	+199%
amoebagk	+48%	+48%	+48%	+49%	+964%
amoebapme	+46%	+47%	+49%	+50%	+184%
amber20-dhfr	+61%	+103%	+109%	+112%	TBD
amber20-cellulose	+143%	+150%	+165%	+237%	TBD
amber20-stmv	+174%	+177%	+206%	+392%	TBD

bdenhollander · 2023-03-25T01:52:12Z

AMD only supports 3 inputs. Can it fuse a comparison and selection into a single cycle?

I can't find anything like that for RDNA. Pre-GCN, TeraScale/VLIW5/VLIV4 might have been able to do something similar but only when comparing to 0.

philipturner · 2023-03-25T02:03:04Z

That would be a 3-input instruction, so it seems AMD doesn't have anything similar.

philipturner · 2023-03-28T03:04:39Z

@philipturner There are loops in gridInterpolateForce and gridSpreadCharge that AMD unrolls. Search for for (int ix = 0; ix < PME_ORDER; ix++) and try unrolling all three levels or the innermost two depending on how many registers are available. This may improve performance on pme benchmarks.

I just unrolled the 1 innermost loop for each of these functions, but both optimizations failed.

Apply optimization

2163fb0

philipturner changed the title ~~Optimize findBlocksWithInteractions for Apple silicon GPUs~~ [macOS GPU Support] Optimize findBlocksWithInteractions for Apple silicon GPUs Feb 12, 2023

philipturner marked this pull request as draft February 25, 2023 03:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[macOS GPU Support] Optimize `findBlocksWithInteractions` for Apple silicon GPUs #3959

[macOS GPU Support] Optimize `findBlocksWithInteractions` for Apple silicon GPUs #3959

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[macOS GPU Support] Optimize findBlocksWithInteractions for Apple silicon GPUs #3959

Are you sure you want to change the base?

[macOS GPU Support] Optimize findBlocksWithInteractions for Apple silicon GPUs #3959

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[macOS GPU Support] Optimize `findBlocksWithInteractions` for Apple silicon GPUs #3959

[macOS GPU Support] Optimize `findBlocksWithInteractions` for Apple silicon GPUs #3959