8000 [macOS GPU Support] Optimize `findBlocksWithInteractions` for Apple silicon GPUs by philipturner · Pull Request #3959 · openmm/openmm · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[macOS GPU Support] Optimize findBlocksWithInteractions for Apple silicon GPUs #3959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

philipturner
Copy link
Contributor

The Apple Metal/AIR backend compiler does not auto-inline loops, even when attempting to force-inline it. This should speedup the kernel by under 19%, speeding up the actual performance by the order of 1% in some benchmarks.

Resolves #3924

@philipturner philipturner changed the title Optimize findBlocksWithInteractions for Apple silicon GPUs [macOS GPU Support] Optimize findBlocksWithInteractions for Apple silicon GPUs Feb 12, 2023
@peastman
Copy link
Member

Here are the speed differences I see on the benchmarks with this change.

Before After Ratio
gbsa 408.387 406.496 0.995
rf 255.078 258.069 1.012
pme 147.489 148.896 1.010
apoa1rf 59.8386 61.0192 1.020
apoa1pme 35.2012 35.4637 1.007
apoa1ljpme 26.1992 26.4046 1.008
amoebagk 0.535711 0.538338 1.005
amoebapme 2.16664 2.14889 0.992
amber20-dhfr 155.635 156.501 1.006
amber20-cellulose 6.96229 7.02266 1.009
amber20-stmv 1.82035 1.84319 1.013

All differences are within the noise level. I'm not sure it's worth adding complexity to the code for a negligible speedup.

@philipturner
Copy link
Contributor Author
philipturner commented Feb 14, 2023

Could you rerun gbsa and amoebapme for about 3 additional trials? There seems to be a statistically significant, universal +1% speedup like I predicted. The only way this would harm performance is by thrashing the instruction cache, if the inner 32-loop had 375 bytes of instructions (47 instructions).

We could also spot other loops in the code base that would benefit from the optimization. Then the speedup might reach 2% or 3%.

@peastman
Copy link
Member

We could also spot other loops in the code base that would benefit from the optimization. Then the speedup might reach 2% or 3%.

Let's see if we can do that. I'm just hesitant to add complexity to an already complex kernel for a 1% speedup on a subset of benchmarks. Especially since it only happens on macOS, and will likely become unnecessary in the future. Loop unrolling is a standard optimization most compilers do. It's surprising that Apple's doesn't, and it's likely they'll add it in the future.

@philipturner
Copy link
Contributor Author

It's surprising that Apple's doesn't, and it's likely they'll add it in the future.

I've seen instances where they do; it just unrolls a small amount, and you can't control it. It might unroll 2-4 small iterations but not 32. Probably a heuristic to account for the M1's comparatively small L1I.

@peastman
Copy link
Member

And I'm guessing they don't support #pragma unroll 32?

@philipturner
Copy link
Contributor Author
philipturner commented 8000 Feb 15, 2023

That doesn't do anything. I remember trying it here and saw no change. The repo provides a very simple way to test how something affects this kernel's performance.

@bdenhollander
Copy link
Contributor

We could also spot other loops in the code base that would benefit from the optimization. Then the speedup might reach 2% or 3%.

Try unrolling some of the < TILE_SIZE loops, which are typically 32 iterations. computeGKForces and computeElectrostatics would be targets for amoebagk and amoebapme respectively.

@philipturner
Copy link
Contributor Author

computeGKForces and computeElectrostatics would be targets for amoebagk and amoebapme respectively.

That would probably harm performance, as the loops are too large. Perhaps force-unrolling up to four iterations might allow more ILP, but we can't overflow the instruction cache.

@philipturner
Copy link
Contributor Author

Ultimately the purpose is to try to implement the optimizations from #3924. They can only be fully realized with Metal, creating a 3% speedup (no half). Limiting to OpenCL features makes this only 1%.

@philipturner philipturner marked this pull request as draft February 25, 2023 03:34
@philipturner
Copy link
Contributor Author
philipturner commented Feb 25, 2023

Converted to draft because #3979

@philipturner
Copy link
Contributor Author
philipturner commented Mar 9, 2023

I just reached 129 -> 138 ns/day on apoa1rf - see philipturner/openmm-metal@830c2e8. The matrix multiply optimization harmed performance, likely because there was already so much threadgroup traffic. In my benchmarking repo, I had simplified the shader and removed a lot of traffic. I think we can see ever greater speedups by incorporating the full CUDA implementation for M1 and maybe RDNA or GCN 3, now that we can use SIMD ballots. The CUDA implementation is modestly different and uses a saveSinglePairs kernel. It also proactively skips certain internal loop iterations, which could boost ns/day beyond the theoretical maximum I predicted.

Before re-opening this PR, I'll need to split the Metal header incorporation into another PR. I need to figure out a way to minimize compiler overhead and license the external dependency properly. Can we build a .pch during the CMake build and embed into a string literal, similar to the .cl files?

And for the subsequent PR, how about the translate the CUDA findBlocksWithInterations implementation into the common compute language? That reduces the headache of maintaining two different OpenCL versions.

@peastman
Copy link
Member

Don't worry about the saveSinglePairs() optimization. The effect is tiny, maybe a 1-2% speedup at best. It also is fairly brittle and needs to be re-tuned with every generation. Even on CUDA it's probably more trouble than it's worth.

The biggest optimization is that we can used a __ballot_sync() combined with a __popc() to do a prefix sum of the binary flags over a whole warp in just a few instructions.

@philipturner
Copy link
Contributor Author

I tried that and now half of the tests are failing. The apoa1rf benchmark is silently crashing. Do you have any idea what's happening?

philipturner/openmm-metal@a0ebbd3

@philipturner
Copy link
Contributor Author

Even without the ballot optimization, some impressive improvements. I still need to tune force thread blocks for AMOEBA as @bdenhollander suggested on the Mac AMD thread. Going by RTX 4080 for comparison, we can reach 141 ns/day on M1 Max apoa1rf. Going by RTX 2080 = 2x its own TFLOPS, 150 ns/day.

  • Optimization 1: fixing SIMD width, thread blocks, and loop unrolling for findBlocksWithInteractions
  • Optimization 2: fixing command queue flushing
  • Optimization 3: switching to simd_prefix_inclusive_sum in findBlocksWithInteractions
ns/day OpenMM 8.0.0 Opt. 1 Opt. 2 Opt. 3 Extrapolating 2080
gbsa 264 378.0 632.8 631.8 974.9
rf 153 304.5 421.5 452.8 568.5
pme 120 199.6 246.8 258.6 463.5
apoa1rf 42 114.9 129.0 138.5 149.7
apoa1pme 33 67.2 72.3 75.3 103.7
apoa1ljpme 28 50.2 53.5 55.1 83.6
amoebagk 0.8 1.184 1.187 1.185 8.512
amoebapme 2.72 3.97 4.00 4.05 7.72
amber20-dhfr TBD 212.0 267.0 275.7 TBD
amber20-cellulose TBD 14.36 14.78 15.69 17.72
amber20-stmv TBD 3.83 3.88 4.29 5.23
Speed vs 8.0.0 Opt. 1 Opt. 2 Opt. 3 Extrapolating 2080
gbsa +43% +140% +139% +269%
rf +99% +175% +196% +271%
pme +66% +106% +116% +286%
apoa1rf +174% +207% +230% +256%
apoa1pme +104% +119% +128% +214%
apoa1ljpme +79% +91% +97% +199%
amoebagk +48% +48% +48% +964%
amoebapme +46% +47% +49% +184%
Peak Watts Opt. 1 Opt. 2 Opt. 3
gbsa 13.2 22.1 21.6
rf 20.0 27.7 28.0
pme 16.8 21.2 21.0
apoa1rf 30.8 35.3 35.8
apoa1pme 23.6 25.6 25.3
apoa1ljpme 23.5 25.0 24.8
amoebagk 15.4 15.3 15.2
amoebapme 17.3 17.3 17.5
amber20-dhfr 16.9 21.1 21.5
amber20-cellulose 27.5 28.1 28.6
amber20-stmv 27.1 27.4 28.5

Optimization 3 does not change watts, and improves everything except AMOEBA, GBSA, GK.

@bdenhollander
Copy link
Contributor

@philipturner There are loops in gridInterpolateForce and gridSpreadCharge that AMD unrolls. Search for for (int ix = 0; ix < PME_ORDER; ix++) and try unrolling all three levels or the innermost two depending on how many registers are available. This may improve performance on pme benchmarks.

PME_ORDER is set to PmeOrder, which is defined as static const int PmeOrder = 5;

@philipturner
Copy link
Contributor Author

If I'm going to maintain a separate Metal backend after all, I think there's less need to integrate the entire header into OpenMM. @peastman could we just paste the declaration for simd_prefix_inclusive_sum(float) and simd_broadcast(float, ushort), and not worry about the header?

@peastman
Copy link
Member

That seems reasonable.

@philipturner
Copy link
Contributor Author
philipturner commented Mar 23, 2023

I got GPT-4's help on the ballot optimization bug. It isn't a silver bullet, but it outlined an incremental approach to test the CUDA optimization bit by bit.

https://gist.github.com/philipturner/dfe2d99a28bf5771c6bcfa589264b5b0

@bdenhollander can you verify this statement? Some of its statements are completely false/hallucinated, but the good statements are very insightful.

“As for the CMPSELECT instruction, I have not found any evidence that the AMD GPU has a similar instruction that can perform two integer operations in one instruction. However, the AMD GPU has other instructions that can perform multiple operations in one instruction, such as V_MAD_I32_I24 or V_MAD_U64_U32. These instructions can perform two integer multiplications and one integer addition in one instruction. These instructions may have different performance characteristics than the CMPSELECT instruction, depending on the data types, the operands, and the latency.”

@bdenhollander
Copy link
Contributor

V_MAD_I32_I24 and V_MAD_U64_U32 are mentioned in all variants of the RDNA ISAs: RDNA, RDNA2, and RDNA3.

VOP3 instructions might be what you're looking for.

@philipturner
Copy link
Contributor Author
philipturner commented Mar 24, 2023

VOP3 instructions might be what you're looking for.

Apple's CMPSEL instruction takes 4 input operands, a massive 80 bits to encode. For example, it is used for the max and select operations. The hardware may also use it for if(x == y) {} else {}, fusing the comparison and branch selection.

max(x,y): CMPSEL x > y ? x : y
select(x,y, a < b): CMPSEL a < b ? x : y

AMD only supports 3 inputs. Can it fuse a comparison and selection into a single cycle?

Screenshot 2023-03-23 at 9 12 08 PM

@philipturner
Copy link
Contributor Author

GPT-4 was right. The bug stems from CUDA __ffs having very different behavior than OpenCL ctz.

The count trailing zeros and count leading zeros functions may not work as expected on OpenCL devices. These functions are also vendor-specific functions that may have different implementations or behaviors on different devices. For example, some devices may return undefined values for zero inputs or have different rounding modes for floating-point inputs. You may want to check the documentation of your device and the OpenCL specification to see if there are any differences or limitations for these functions. You may also want to test these functions with some simple inputs and outputs to see if they match your expectations.

@philipturner
Copy link
Contributor Author
philipturner commented Mar 24, 2023

Optimization 4: ballot optimization. Look at apoa1rf, amber20-cellulose, and amber20-stmv.

ns/day OpenMM 8.0.0 Opt. 1 Opt. 2 Opt. 3 Opt. 4 Extrapolating 2080
gbsa 264 378.0 632.8 631.8 635.0 974.9
rf 153 304.5 421.5 452.8 476.3 568.5
pme 120 199.6 246.8 258.6 267.2 463.5
apoa1rf 42 114.9 129.0 138.5 151.4 149.7
apoa1pme 33 67.2 72.3 75.3 79.0 103.7
apoa1ljpme 28 50.2 53.5 55.1 56.9 83.6
amoebagk 0.8 1.184 1.187 1.185 1.189 8.512
amoebapme 2.72 3.97 4.00 4.05 4.07 7.72
amber20-dhfr 131.7 212.0 267.0 275.7 279.2 TBD
amber20-cellulose 5.92 14.36 14.78 15.69 19.93 17.72
amber20-stmv 1.40 3.83 3.88 4.29 6.89 5.23
Speed vs 8.0.0 Opt. 1 Opt. 2 Opt. 3 Opt. 4 Extrapolating 2080
gbsa +43% +140% +139% +141% +269%
rf +99% +175% +196% +211% +271%
pme +66% +106% +116% +123% +286%
apoa1rf +174% +207% +230% +260% +256%
apoa1pme +104% +119% +128% +139% +214%
apoa1ljpme +79% +91% +97% +103% +199%
amoebagk +48% +48% +48% +49% +964%
amoebapme +46% +47% +49% +50% +184%
amber20-dhfr +61% +103% +109% +112% TBD
amber20-cellulose +143% +150% +165% +237% TBD
amber20-stmv +174% +177% +206% +392% TBD

@bdenhollander
Copy link
Contributor

AMD only supports 3 inputs. Can it fuse a comparison and selection into a single cycle?

I can't find anything like that for RDNA. Pre-GCN, TeraScale/VLIW5/VLIV4 might have been able to do something similar but only when comparing to 0.
image

@philipturner
Copy link
Contributor Author

That would be a 3-input instruction, so it seems AMD doesn't have anything similar.

@philipturner
Copy link
Contributor Author

@philipturner There are loops in gridInterpolateForce and gridSpreadCharge that AMD unrolls. Search for for (int ix = 0; ix < PME_ORDER; ix++) and try unrolling all three levels or the innermost two depending on how many registers are available. This may improve performance on pme benchmarks.

I just unrolled the 1 innermost loop for each of these functions, but both optimizations failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimizing findBlocksWithInteractions for M1
3 participants
0