8000 RFC for vector length agnostic SVE Vectorized class · Issue #153471 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

8000 RFC for vector length agnostic SVE Vectorized class #153471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Ryo-not-rio opened this issue May 13, 2025 · 5 comments
Open

RFC for vector length agnostic SVE Vectorized class #153471

Ryo-not-rio opened this issue May 13, 2025 · 5 comments
Labels
enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: vectorization Related to SIMD vectorization, e.g., Vec256 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@Ryo-not-rio
Copy link
Collaborator
Ryo-not-rio commented May 13, 2025

🚀 The feature, motivation and pitch

This is a proposal for a vector length agnostic Vectorized class to replace the current implementation for the SVE Vectorized class. This proposal will allow us to add support for all SVE vector lengths without any duplication of code and without any regressions compared to the current SVE Vectorized class.
The main idea behind this implementation is to replace the vector in the Vectorized class with an array which we will load from and store to with each op. By making use of compiler optimizations, this allows us to make the Vectorized class vector length agnostic without any regressions.
The drawback of this approach is that we will have to make the size() function not a constexpr as SVE vector lengths are not known at compile time. This would require a rewrite of a number of existing functions but should not offer any performance regressions.

Additional context

An RFC has been created at pytorch/rfcs#73 which goes into more detail.

cc @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01

@malfet malfet added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: vectorization Related to SIMD vectorization, e.g., Vec256 module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 labels May 14, 2025
@malfet
Copy link
Contributor
malfet commented May 14, 2025

The size() function which returns the number of elements in the Vectorized class cannot be constexpr in our implmentation due to SVE vector lengths being unknown at compile time. We propose we change this to be const instead of constexpr.

constexpr size allows for a numerous optimizations on the compiler side, and making it dynamic will likely result in a significant slowdown. To accept change like that, that will affect all supported CPU architectures a rigorous test plan is necessary as well as list of HW one must run it against.

@Ryo-not-rio
Copy link
Collaborator Author
Ryo-not-rio commented May 15, 2025

@malfet What do you think of the following alternatives?

  1. Keep size() constexpr for non-SVE Vectorized class and change the affected code only for SVE - we will only have to benchmark aarch64 machines in this case
  2. Use a hybrid approach where we decide the size() on compile time. This will keep size() constexpr but will not be able to take advantage of SVE's runtime vector length detection feature

@malfet
Copy link
Contributor
malfet commented May 15, 2025
  1. Keep size() constexpr for non-SVE Vectorized class and change the affected code only for SVE - we will only have to benchmark aarch64 machines in this case

If changes can be constrainted to just veclib, it sounds fine, but if entire codebase needs to be sprinkled with #ifndef __aarch64__ than this does not sound like a good idea.

But one still need to define a benchmarks to make sure this will not cause regressions on aarch64 platform.

  1. Use a hybrid approach where we decide the size() on compile time. This will keep size() constexpr but will not be able to take advantage of SVE's runtime vector length detection feature

Sure, we can do that, but for practical purposes it feels like it will leave at roughly where we are right now, when we have NEON and SVE256. But it would be great if this could be auto-detected when torch.compile is invoked

@Ryo-not-rio
Copy link
Collaborator Author
Ryo-not-rio commented May 15, 2025

Sure, we can do that, but for practical purposes it feels like it will leave at roughly where we are right now, when we have NEON and SVE256

The code size will be basically similar to now except adding the vector length detection for 128 & 512. It will offer significant performance boosts to any functions using the exponential function as this is accelerated on SVE but not NEON

@aditew01
Copy link
Collaborator

cc: @maajidkhann

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: vectorization Related to SIMD vectorization, e.g., Vec256 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants
0