-
Notifications
You must be signed in to change notification settings - Fork 556
[macOS GPU Support] Optimize findBlocksWithInteractions
for Apple silicon GPUs
#3959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
findBlocksWithInteractions
for Apple silicon GPUsfindBlocksWithInteractions
for Apple silicon GPUs
Here are the speed differences I see on the benchmarks with this change.
All differences are within the noise level. I'm not sure it's worth adding complexity to the code for a negligible speedup. |
Could you rerun We could also spot other loops in the code base that would benefit from the optimization. Then the speedup might reach 2% or 3%. |
Let's see if we can do that. I'm just hesitant to add complexity to an already complex kernel for a 1% speedup on a subset of benchmarks. Especially since it only happens on macOS, and will likely become unnecessary in the future. Loop unrolling is a standard optimization most compilers do. It's surprising that Apple's doesn't, and it's likely they'll add it in the future. |
I've seen instances where they do; it just unrolls a small amount, and you can't control it. It might unroll 2-4 small iterations but not 32. Probably a heuristic to account for the M1's comparatively small L1I. |
And I'm guessing they don't support |
That doesn't do anything. I remember trying it here and saw no change. The repo provides a very simple way to test how something affects this kernel's performance. |
Try unrolling some of the |
That would probably harm performance, as the loops are too large. Perhaps force-unrolling up to four iterations might allow more ILP, but we can't overflow the instruction cache. |
Ultimately the purpose is to try to implement the optimizations from #3924. They can only be fully realized with Metal, creating a 3% speedup (no |
Converted to draft because #3979 |
I just reached 129 -> 138 ns/day on apoa1rf - see philipturner/openmm-metal@830c2e8. The matrix multiply optimization harmed performance, likely because there was already so much Before re-opening this PR, I'll need to split the Metal header incorporation into another PR. I need to figure out a way to minimize compiler overhead and license the external dependency properly. Can we build a And for the subsequent PR, how about the translate the CUDA findBlocksWithInterations implementation into the common compute language? That reduces the headache of maintaining two different OpenCL versions. |
Don't worry about the The biggest optimization is that we can used a |
I tried that and now half of the tests are failing. The |
Even without the ballot optimization, some impressive improvements. I still need to tune force thread blocks for AMOEBA as @bdenhollander suggested on the Mac AMD thread. Going by RTX 4080 for comparison, we can reach 141 ns/day on M1 Max
Optimization 3 does not change watts, and improves everything except AMOEBA, GBSA, GK. |
@philipturner There are loops in
|
If I'm going to maintain a separate Metal backend after all, I think there's less need to integrate the entire header into OpenMM. @peastman could we just paste the declaration for |
That seems reasonable. |
I got GPT-4's help on the ballot optimization bug. It isn't a silver bullet, but it outlined an incremental approach to test the CUDA optimization bit by bit. https://gist.github.com/philipturner/dfe2d99a28bf5771c6bcfa589264b5b0 @bdenhollander can you verify this statement? Some of its statements are completely false/hallucinated, but the good statements are very insightful.
|
Apple's CMPSEL instruction takes 4 input operands, a massive 80 bits to encode. For example, it is used for the
AMD only supports 3 inputs. Can it fuse a comparison and selection into a single cycle? |
GPT-4 was right. The bug stems from CUDA
|
Optimization 4: ballot optimization. Look at
|
I can't find anything like that for RDNA. Pre-GCN, TeraScale/VLIW5/VLIV4 might have been able to do something similar but only when comparing to 0. |
That would be a 3-input instruction, so it seems AMD doesn't have anything similar. |
I just unrolled the 1 innermost loop for each of these functions, but both optimizations failed. |
The Apple Metal/AIR backend compiler does not auto-inline loops, even when attempting to force-inline it. This should speedup the kernel by under 19%, speeding up the actual performance by the order of 1% in some benchmarks.
Resolves #3924