[RFC] Cuda support matrix for Release 2.4 #123456

atalman · 2024-04-05T17:13:08Z

🚀 [RFC] Cuda support matrix for Release 2.4

Opening this RFC to discuss CUDA version support for future PyTorch releases:

Option 1 - CUDA 11 and CUDA 12:
CUDA 11.8, CUDNN 8.9.7.29
CUDA 12.4, CUDNN 8.9.7.29 - Version hosted on pypi

Option 2 - CUDA 12:
CUDA 12.1, CUDNN 8.9.7.29
CUDA 12.4, CUDNN 8.9.7.29 - Version hosted on pypi

Option 3
CUDA 11.8, CUDNN 8.9.7.29
CUDA 12.1, CUDNN 8.9.7.29 - Version hosted on pypi, as stable
CUDA 12.4, CUDNN 8.9.7.29 - Experimental version

(Please note CUDNN version listed here 8.9.7.29 is not final, we may upgrade it for 2.4 release)

One advantage of Option 1 is the fact that older cuda driver is not compatible with CUDA 12 hence people with older drivers will benefit latest pytorch.

Please refer to:
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#minor-version-compatibility

CUDA Toolkit	Linux x86_64 Minimum Required Driver Version	Windows Minimum Required Driver Version
CUDA 12.x	>=525.60.13	>=527.41
CUDA 11.x	>= 450.80.02*	>=452.39*

cc @seemethere @malfet @osalpekar @ptrblck @ezyang

bghira · 2024-04-05T17:58:31Z

12.x has inexplicably higher memory use than 11.8 for training 2d condition unet models.

ptrblck · 2024-04-05T18:00:32Z

8000

@bghira Could you add a link to the corresponding issue, please?

ptrblck · 2024-04-05T18:04:45Z

One advantage of Option 1 is the fact that older cuda driver is not compatible with CUDA 12 hence people with older drivers will benefit latest pytorch.

@atalman Agreed, but also CUDA 12.x is now out for > 1 year and we are also providing PyTorch binaries with CUDA 12 for > 1 year.
It would be interesting to see some stats how many CUDA 11 vs. 12 downloads we have to be able to deprecate older CUDA versions.
Additionally, we should discuss the compute capability requirements, since e.g. sm_37 is dropped from the CUDA 12.x builds.

bghira · 2024-04-05T18:24:30Z

@ptrblck this is internal research that my group worked on, and we never filed an issue as it was unclear which level the issue was introduced in, and we didn't have the resources to dig into that. i can say that the vast (lol, pun) majority of cloud instances/containers/kernels the users will have access to presently will be limited to CUDA 11.8 - it's convenient enough that making 12.1 the minimum feels premature, despite how long that's been available.

making ROCm 6 the minimum made sense, because everything about ROCm 5.x was awful, other than the fact that it supported a few more GPUs than 6 does. but CUDA 11.8 was very mature and isn't showing its age yet.

ptrblck · 2024-04-05T19:30:27Z

this is internal research that my group worked on, and we never filed an issue as it was unclear which level the issue was introduced in, and we didn't have the resources to dig into that.

In this case the claim is not actionable and since we are already using CUDA 12.1 in the default PyTorch binary (installable via pip install torch) for some time, I highly doubt an increase in memory is a valid observation.

i can say that the vast (lol, pun) majority of cloud instances/containers/kernels the users will have access to presently will be limited to CUDA 11.8

Could you share any information here too?

making ROCm 6 the minimum made sense,...

This RFC focuses on CUDA and we should not discuss rocm here.

bghira · 2024-04-05T19:40:21Z

In this case the claim is not actionable and since we are already using CUDA 12.1 in the default PyTorch binary (installable via pip install torch) for some time, I highly doubt an increase in memory is a valid observation.

i don't think anyone even installs torch that way due to the high probability of issues.

the download page for torch "builds" a command for people to use, which ends up adding the index-url option to point me to 11.8, which is the only way to install it successfully on most container hosts i've used.

i guess these claims aren't enough to go on, and the newer version will just remain inaccessible for a while.

ptrblck · 2024-04-05T21:09:11Z

i don't think anyone even installs torch that way due to the high probability of issues.

I think ~25 million downloads/month as of now don't confirm your claim: https://pypistats.org/packages/torch

In any case, if you have any valid issues, please let us know and we are happy to follow up!

bghira · 2024-04-05T22:10:55Z

I think ~25 million downloads/month as of now don't confirm your claim: https://pypistats.org/packages/torch

i don't think those stats confirm yours.

how many are CI systems that just download it automatically to build?
how many are users that are downloading, redownloading, hoping to find a working version combination?
how many are using CUDA at all, vs just CPU?

ptrblck · 2024-04-06T13:06:40Z

@bghira Again, if you have concrete issues, please create separate issues for them and we are happy to help.
So far you haven't shared anything valid besides unverified claims about functionality issues, user's behaviors, as well as cloud setups, which I see as noise in this thread.

bghira · 2024-04-06T13:26:25Z

just because you don't like them doesn't make them invalid. i have trouble understanding why an nvidia representative is being so difficult about keeping support for CUDA 11.8 in a future pytorch release, which is entirely what I am here advocating for..

your approach is essentially coming across as if CUDA 12.1 is going to be the default unless someone provides really good reasons why it shouldn't be. i thought "we can't use 12.1" would work. this isn't where issues with CUDA 12.1 get reported.

maybe someone else should be handling this ticket, since you are too personally involved. can @atalman be the one to respond from now on? thank you

ptrblck · 2024-04-06T13:45:37Z

just because you don't like them doesn't make them invalid. i have trouble understanding why an nvidia representative is being so difficult about keeping support for CUDA 11.8 in a future pytorch release, which is entirely what I am here advocating for..

You are misunderstanding my posts, since I asked about concrete issues to follow up with in my very first response. Speculations just diverge the tracking issue here and are not helpful. I also don't have trouble keeping PyTorch + CUDA 11.8 binaries alive longer as I even added the concern about dropping compute capabilities.

your approach is essentially coming across as if CUDA 12.1 is going to be the default unless ...

It is already the default installable via pip install torch, so I'm not concerned about it.

This will be my last response to you, @bghira, since you are still diverging this topic without any actionable items.

ptrblck · 2024-04-06T14:27:37Z

@atalman For option 3:

Would it be possible to add a UserWarning explaining the future deprecation of CUDA 11 builds and the need to update to a newer driver?
We should check if the experimental CUDA 12.4 binaries should ship with cuDNN 9.x.

Godricly · 2024-04-22T10:04:20Z

Will the NCCL be upgraded to 2.21.5 if the CUDA version is 12.4?

bhack · 2024-05-15T22:15:38Z

For 11.8 driver r470 will go EOL 20 Jul 2024 and so I suppose also the related containers are going EOL:
https://gitlab.com/nvidia/container-images/cuda/-/blob/master/doc/support-policy.md
https://gitlab.com/nvidia/container-images/cuda/-/blob/master/doc/container_tags.pdf

atalman · 2024-08-20T19:15:16Z

Closing this one Since 2.4 release is complete

malfet added module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 5, 2024

malfet added this to PyTorch OSS Release Engineering Apr 5, 2024

isVoid mentioned this issue Apr 18, 2024

Pytorch Tensor with bf16 data NVIDIA/numbast#34

Closed

atalman moved this to Cold Storage in PyTorch OSS Release Engineering May 9, 2024

lw mentioned this issue Aug 12, 2024

🚀 Precompiled xFormers for CUDA 12.4 and PyTorch 2.4 Compatibility facebookresearch/xformers#1079

Closed

atalman mentioned this issue Aug 20, 2024

[RFC] Cuda support matrix for Release 2.5 #134015

Closed

atalman closed this as completed Aug 20, 2024

github-project-automation bot moved this from Cold Storage to Done in PyTorch OSS Release Engineering Aug 20, 2024

atalman mentioned this issue May 19, 2025

Cuda update policy and guide pytorch/rfcs#74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Cuda support matrix for Release 2.4 #123456

[RFC] Cuda support matrix for Release 2.4 #123456

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[RFC] Cuda support matrix for Release 2.4 #123456

[RFC] Cuda support matrix for Release 2.4 #123456

Comments

Uh oh!