8000 [RFC] Cuda support matrix for Release 2.4 · Issue #123456 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[RFC] Cuda support matrix for Release 2.4 #123456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
atalman opened this issue Apr 5, 2024 · 15 comments
Closed

[RFC] Cuda support matrix for Release 2.4 #123456

atalman opened this issue Apr 5, 2024 · 15 comments
Labels
module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@atalman
Copy link
Contributor
atalman commented Apr 5, 2024

🚀 [RFC] Cuda support matrix for Release 2.4

Opening this RFC to discuss CUDA version support for future PyTorch releases:

Option 1 - CUDA 11 and CUDA 12:
CUDA 11.8, CUDNN 8.9.7.29
CUDA 12.4, CUDNN 8.9.7.29 - Version hosted on pypi

Option 2 - CUDA 12:
CUDA 12.1, CUDNN 8.9.7.29
CUDA 12.4, CUDNN 8.9.7.29 - Version hosted on pypi

Option 3
CUDA 11.8, CUDNN 8.9.7.29
CUDA 12.1, CUDNN 8.9.7.29 - Version hosted on pypi, as stable
CUDA 12.4, CUDNN 8.9.7.29 - Experimental version

(Please note CUDNN version listed here 8.9.7.29 is not final, we may upgrade it for 2.4 release)

One advantage of Option 1 is the fact that older cuda driver is not compatible with CUDA 12 hence people with older drivers will benefit latest pytorch.

Please refer to:
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#minor-version-compatibility

CUDA Toolkit Linux x86_64 Minimum Required Driver Version Windows Minimum Required Driver Version
CUDA 12.x >=525.60.13 >=527.41
CUDA 11.x >= 450.80.02* >=452.39*

cc @seemethere @malfet @osalpekar @ptrblck @ezyang

@bghira
Copy link
bghira commented Apr 5, 2024

12.x has inexplicably higher memory use than 11.8 for training 2d condition unet models.

@ptrblck
8000
Copy link
Collaborator
ptrblck commented Apr 5, 2024

@bghira Could you add a link to the corresponding issue, please?

@ptrblck
Copy link
Collaborator
ptrblck commented Apr 5, 2024

One advantage of Option 1 is the fact that older cuda driver is not compatible with CUDA 12 hence people with older drivers will benefit latest pytorch.

@atalman Agreed, but also CUDA 12.x is now out for > 1 year and we are also providing PyTorch binaries with CUDA 12 for > 1 year.
It would be interesting to see some stats how many CUDA 11 vs. 12 downloads we have to be able to deprecate older CUDA versions.
Additionally, we should discuss the compute capability requirements, since e.g. sm_37 is dropped from the CUDA 12.x builds.

@bghira
Copy link
bghira commented Apr 5, 2024

@ptrblck this is internal research that my group worked on, and we never filed an issue as it was unclear which level the issue was introduced in, and we didn't have the resources to dig into that. i can say that the vast (lol, pun) majority of cloud instances/containers/kernels the users will have access to presently will be limited to CUDA 11.8 - it's convenient enough that making 12.1 the minimum feels premature, despite how long that's been available.

making ROCm 6 the minimum made sense, because everything about ROCm 5.x was awful, other than the fact that it supported a few more GPUs than 6 does. but CUDA 11.8 was very mature and isn't showing its age yet.

@ptrblck
Copy link
Collaborator
ptrblck commented Apr 5, 2024

this is internal research that my group worked on, and we never filed an issue as it was unclear which level the issue was introduced in, and we didn't have the resources to dig into that.

In this case the claim is not actionable and since we are already using CUDA 12.1 in the default PyTorch binary (installable via pip install torch) for some time, I highly doubt an increase in memory is a valid observation.

i can say that the vast (lol, pun) majority of cloud instances/containers/kernels the users will have access to presently will be limited to CUDA 11.8

Could you share any information here too?

making ROCm 6 the minimum made sense,...

This RFC focuses on CUDA and we should not discuss rocm here.

@bghira
Copy link
bghira commented Apr 5, 2024

In this case the claim is not actionable and since we are already using CUDA 12.1 in the default PyTorch binary (installable via pip install torch) for some time, I highly doubt an increase in memory is a valid observation.

i don't think anyone even installs torch that way due to the high probability of issues.

the download page for torch "builds" a command for people to use, which ends up adding the index-url option to point me to 11.8, which is the only way to install it successfully on most container hosts i've used.

i guess these claims aren't enough to go on, and the newer version will just remain inaccessible for a while.

@ptrblck
Copy link
Collaborator
ptrblck commented Apr 5, 2024

i don't think anyone even installs torch that way due to the high probability of issues.

I think ~25 million downloads/month as of now don't confirm your claim: https://pypistats.org/packages/torch

In any case, if you have any valid issues, please let us know and we are happy to follow up!

@malfet malfet added module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 5, 2024
@bghira
Copy link
bghira commented Apr 5, 2024

I think ~25 million downloads/month as of now don't confirm your claim: https://pypistats.org/packages/torch

i don't think those stats confirm yours.

  • how many are CI systems that just download it automatically to build?
  • how many are users that are downloading, redownloading, hoping to find a working version combination?
  • how many are using CUDA at all, vs just CPU?

@ptrblck
Copy link
Collaborator
ptrblck commented Apr 6, 2024

@bghira Again, if you have concrete issues, please create separate issues for them and we are happy to help.
So far you haven't shared anything valid besides unverified claims about functionality issues, user's behaviors, as well as cloud setups, which I see as noise in this thread.

@bghira
Copy link
bghira commented Apr 6, 2024

just because you don't like them doesn't make them invalid. i have trouble understanding why an nvidia representative is being so difficult about keeping support for CUDA 11.8 in a future pytorch release, which is entirely what I am here advocating for..

your approach is essentially coming across as if CUDA 12.1 is going to be the default unless someone provides really good reasons why it shouldn't be. i thought "we can't use 12.1" would work. this isn't where issues with CUDA 12.1 get reported.

maybe someone else should be handling this ticket, since you are too personally involved. can @atalman be the one to respond from now on? thank you

@ptrblck
Copy link
Collaborator
ptrblck commented Apr 6, 2024

just because you don't like them doesn't make them invalid. i have trouble understanding why an nvidia representative is being so difficult about keeping support for CUDA 11.8 in a future pytorch release, which is entirely what I am here advocating for..

You are misunderstanding my posts, since I asked about concrete issues to follow up with in my very first response. Speculations just diverge the tracking issue here and are not helpful. I also don't have trouble keeping PyTorch + CUDA 11.8 binaries alive longer as I even added the concern about dropping compute capabilities.

your approach is essentially coming across as if CUDA 12.1 is going to be the default unless ...

It is already the default installable via pip install torch, so I'm not concerned about it.

This will be my last response to you, @bghira, since you are still diverging this topic without any actionable items.

@ptrblck
Copy link
Collaborator
ptrblck commented Apr 6, 2024

@atalman For option 3:

  • Would it be possible to add a UserWarning explaining the future deprecation of CUDA 11 builds and the need to update to a newer driver?
  • We should check if the experimental CUDA 12.4 binaries should ship with cuDNN 9.x.

@Godricly
Copy link

Will the NCCL be upgraded to 2.21.5 if the CUDA version is 12.4?

@atalman atalman moved this to Cold Storage in PyTorch OSS Release Engineering May 9, 2024
@bhack
Copy link
Contributor
bhack commented May 15, 2024

@atalman
Copy link
Contributor Author
atalman commented Aug 20, 2024

Closing this one Since 2.4 release is complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Development

No branches or pull requests

6 participants
0