-
Notifications
You must be signed in to change notification settings - Fork 24.1k
[Inductor] Permuted memory access pattern for tiled pointwise kernels #148718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looking back at this PR, it seems like the strides were flipped back then as well. Maybe this shouldn't matter? It still feels like it would be more intuitive to make the trailing stride 1, though. |
cc @jansel: Do you know if the data is being transposed by the triton kernel? Or am I just reading this wrong? |
This looks like a bug in loop ordering to me, it is correct but this is not a good loop order. cc @shunting314 What happens if you set this config False: pytorch/torch/_inductor/config.py Line 156 in 17dbeb1
|
Could be an issue specific to the block ptr implementation? If disabling the block pointer, the triton kernel generated have the correct loop order:
|
@shunting314 That was my thought as well, but after looking at it a little more I'm now thinking the same loop order is used for both block pointers and non-block pointers, and they're both getting flipped. There are cases where we can't emit a block pointer, and the kernel ends up mixing block and non-block pointers. Those cases seem to have compatible tensor dims, strides, etc. There's an example in https://github.com/pytorch/pytorch/issues/148725. In your example, will the pointer |
@jansel I tried this out and it seems like we get the same loop order with either setting of I created a branch to experiment with reversing the order returned by |
Yes. I think the reason it's [XBLOCK, YBLOCK} rather than [YBLOCK, XBLOCK] is due to how we construct the index:
If we did another way like
the tile shape will be [YBLOCK, XBLOCK]. But it does not matter for non-block pointer. |
@shunting314 @jansel I was doubting whether the existing order is really flipped, so I decided to run some tests on handwritten triton kernels. I've attached the scripts and their output here: Kernel 1: X first (what Inductor does today)The first script
When I ran this script, the printed indices were discontiguous. This is essentially accessing the matrix in column-major order, although the underlying data is row-major.
Kernel 2: Y firstIn the second file,
After swapping the slices, the indices became contiguous.
Kernel 3: X first, permuted loop orderThe bad news is that changing Inductor's loop order doesn't generate the code from file 2. The third script,
In this case, the strides are permuted compared to kernel #1, but X is still the leading dimension. This gives us contiguous indices, the same as in kernel #2. But IMO the code is less intuitive to reason about than #2. Permuting the strides also seems to affect some of the other loop optimizations that Inductor does, like loop fusion. So it may be more desirable to generate #2 by changing the Triton backend, compared to swapping the loop order higher in the stack.
What do you think? Do these tests seem reasonable? I also tried some performance tests with larger sizes, but the results were inconclusive. I'll keep looking to see if I can find a test case that shows a clear difference. |
For kernel 1, I think the memory access pattern looks bad mainly because the block size is too small. For reasonable large block sizes, our memory access pattern should still be good |
That's a good point. Is the idea that we eventually get As long as Anyways, would there be any objection to swapping the memory access pattern to look like kernel 2, if it's at least perf neutral on GPU? That would work better on systems that rely on DMAs for memory accesses, where you're hoping that the memory will be fetched in large contiguous spans. |
I was thinking that triton would be able to find more reasonable layout for non-trivial block sizes. |
…re x (#149339) # Feature Fixes #148718 by reordering the tensor dims to `(z, y, x)`. As a bonus refactor, block pointers no longer needed the `reorder=True` argument to `self.active_range_trees()`. Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order. # Perf impact It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. ([Workflow run](https://github.com/pytorch/pytorch/actions/runs/13914815576).) Training (all neutral or positive):  Inference (one positive, one very small negative):  As reported in #148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's [kernel profiling guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html): > Local memory is private storage for an executing thread and is not visible outside of that thread. It is intended for thread-local data like thread stacks and register spills. Local memory addresses are translated to global virtual addresses by the AGU unit. Local memory has the same latency as global memory. One difference between global and local memory is that local memory is arranged such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable, etc.). I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache. > The L2 Request Coalescer (LRC) processes incoming requests for L2 and tries to coalesce read requests before forwarding them to the L2 cache. It also serves programmatic multicast requests from the SM and supports compression for writes. The [answer to this Stack Overflow post](https://stackoverflow.com/a/5044424) also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example. Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel. # Test plan - Updated expected code on CI tests. - Added a new test checking the {x,y,z}indices and block pointers on a 3D pointwise kernel. Pull Request resolved: #149339 Approved by: https://github.com/jansel
…re x (pytorch#149339) # Feature Fixes pytorch#148718 by reordering the tensor dims to `(z, y, x)`. As a bonus refactor, block pointers no longer needed the `reorder=True` argument to `self.active_range_trees()`. Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order. # Perf impact It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. ([Workflow run](https://github.com/pytorch/pytorch/actions/runs/13914815576).) Training (all neutral or positive):  Inference (one positive, one very small negative):  As reported in pytorch#148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's [kernel profiling guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html): > Local memory is private storage for an executing thread and is not v 6843 isible outside of that thread. It is intended for thread-local data like thread stacks and register spills. Local memory addresses are translated to global virtual addresses by the AGU unit. Local memory has the same latency as global memory. One difference between global and local memory is that local memory is arranged such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable, etc.). I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache. > The L2 Request Coalescer (LRC) processes incoming requests for L2 and tries to coalesce read requests before forwarding them to the L2 cache. It also serves programmatic multicast requests from the SM and supports compression for writes. The [answer to this Stack Overflow post](https://stackoverflow.com/a/5044424) also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example. Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel. # Test plan - Updated expected code on CI tests. - Added a new test checking the {x,y,z}indices and block pointers on a 3D pointwise kernel. Pull Request resolved: pytorch#149339 Approved by: https://github.com/jansel
🐛 Describe the bug
On this example program:
Inductor generates the Triton kernel:
This is technically correct, but the shapes and strides seem permuted for the load and store. Normally, I would expect the trailing stride to be
1
for both. On certain systems, it's more expensive to load/store permuted strides because you end up doing a DMA in the normal order, then permuting on chip.I want to see if we can flip the dimensions back so
strides=[32, 1]
. Or does it even matter? Is this just a matter of convention?Versions
Collecting environment information...
PyTorch version: 2.7.0a0+gitbd78b54
Is debug build: True
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: CentOS Stream 9 (x86_64)
GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version: Could not collect
CMake version: version 3.30.4
Libc version: glibc-2.34
Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.4.3-0_fbk14_hardened_2601_gcd42476b84e9-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA PG509-210
Nvidia driver version: 550.90.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 22
On-line CPU(s) list: 0-21
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8339HC CPU @ 1.80GHz
CPU family: 6
Model: 85
Thread(s) per core: 1
Core(s) per socket: 22
Socket(s): 1
Stepping: 11
BogoMIPS: 3591.57
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 arat vnmi umip pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 704 KiB (22 instances)
L1i cache: 704 KiB (22 instances)
L2 cache: 88 MiB (22 instances)
L3 cache: 16 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-21
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Vulnerable: eIBRS with unprivileged eBPF
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; TSX disabled
Versions of relevant libraries:
[pip3] flake8==6.1.0
[pip3] flake8-bugbear==23.3.23
[pip3] flake8-comprehensions==3.15.0
[pip3] flake8-executable==2.1.3
[pip3] flake8-logging-format==0.9.0
[pip3] flake8-pyi==23.3.1
[pip3] flake8-simplify==0.19.3
[pip3] mypy==1.13.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] optree==0.13.0
[pip3] pytorch_sphinx_theme==0.0.24
[pip3] pytorch-triton==3.0.0+dedb7bdf33
[pip3] torch==2.7.0a0+gitbd78b54
[pip3] torchaudio==2.5.0a0+79047bf
[pip3] torchbench==0.1
[pip3] torchdata==0.10.0a0+c64801f
[pip3] torchtext==0.17.0a0+1d4ce73
[pip3] triton==3.1.0
[conda] blas 1.0 mkl
[conda] magma-cuda116 2.6.1 1 pytorch
[conda] magma-cuda121 2.6.1 1 pytorch
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-include 2023.1.0 h06a4308_46344
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.8 py310h5eee18b_0
[conda] mkl_random 1.2.4 py310hdb19cb5_0
[conda] numpy 2.1.2 pypi_0 pypi
[conda] numpy-base 1.26.4 py310hb5e798b_0
[conda] nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.5.147 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.21.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.4.127 pypi_0 pypi
[conda] optree 0.13.0 pypi_0 pypi
[conda] pytorch-sphinx-theme 0.0.24 pypi_0 pypi
[conda] pytorch-triton 3.0.0+dedb7bdf33 pypi_0 pypi
[conda] torch 2.7.0a0+gitbd78b54 dev_0
[conda] torchaudio 2.5.0a0+79047bf dev_0
[conda] torchbench 0.1 pypi_0 pypi
[conda] torchdata 0.7.0a0+11bb5b8 pypi_0 pypi
[conda] torchfix 0.4.0 pypi_0 pypi
[conda] torchtext 0.17.0a0+1d4ce73 dev_0
[conda] triton 3.1.0 pypi_0 pypi
cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @aakhundov
The text was updated successfully, but these errors were encountered: