8000 FITS/FILM/GP-VAE fail when running on multiple CUDA devices · Issue #632 · WenjieDu/PyPOTS · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

FITS/FILM/GP-VAE fail when running on multiple CUDA devices #632

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 2 tasks
WenjieDu opened this issue Mar 13, 2025 · 2 comments
Open
1 of 2 tasks

FITS/FILM/GP-VAE fail when running on multiple CUDA devices #632

WenjieDu opened this issue Mar 13, 2025 · 2 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@WenjieDu
Copy link
Owner
WenjieDu commented Mar 13, 2025

1. System Info

v0.11

2. Information

  • The official example scripts
  • My own created scripts

3. Reproduction

  • pypots.clustering.crli
  • pypots.imputation.usgan
  • pypots.imputation.koopa
  • pypots.imputation.film
  • pypots.imputation.gpvae
  • pypots.imputation.fits
  • pypots.forecasting.fits

4. Expected behavior

For pypots.forecasting.fits and pypots.imputation.fits we have

E       RuntimeError: Caught RuntimeError in replica 0 on device 1.
E       Original Traceback (most recent call last):
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
E           output = module(*input, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/PyPOTS_dev/pypots/forecasting/fits/core.py", line 68, in forward
E           enc_out = self.backbone(enc_out)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/PyPOTS_dev/pypots/nn/modules/fits/backbone.py", line 63, in forward
E           low_specxy_ = self.freq_upsampler(low_specx.permute(0, 2, 1))
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
E           return F.linear(input, self.weight, self.bias)
E       RuntimeError: t() expects a tensor with <= 2 dimensions, but self is 3D

For pypots.imputation.film we have

E       RuntimeError: Caught RuntimeError in replica 0 on device 1.
E       Original Traceback (most recent call last):
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
E           output = module(*input, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/PyPOTS_dev/pypots/imputation/film/core.py", line 65, in forward
E           backbone_output = self.backbone(X_embedding)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/PyPOTS_dev/pypots/nn/modules/film/backbone.py", line 65, in forward
E           out1 = self.spec_conv_1[i](x_in_c)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/PyPOTS_dev/pypots/nn/modules/film/layers.py", line 128, in forward
E           out_ft[:, :, :, : self.modes2] = torch.einsum("bjix,iox->bjox", a, self.weights1)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/functional.py", line 380, in einsum
E           return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
E       RuntimeError: einsum(): the number of subscripts in the equation (3) does not match the number of dimensions (4) for operand 1 and no ellipsis was given

For pypots.imputation.gpvae we have

E       RuntimeError: Caught RuntimeError in replica 1 on device 2.
E       Original Traceback (most recent call last):
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
E           output = module(*input, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/PyPOTS_dev/pypots/imputation/gpvae/core.py", line 97, in forward
E           elbo_loss = self.backbone(X, missing_mask)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/PyPOTS_dev/pypots/nn/modules/gpvae/backbone.py", line 157, in forward
E           self.prior = self._init_prior(device=X.device)
E         File "/home/wdudu/PyPOTS_dev/pypots/nn/modules/gpvae/backbone.py", line 137, in _init_prior
E           prior = torch.distributions.MultivariateNormal(
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/distributions/multivariate_normal.py", line 177, in __init__
E           super().__init__(batch_shape, event_shape, validate_args=validate_args)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/distributions/distribution.py", line 66, in __init__
E           valid = constraint.check(value)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/distributions/constraints.py", line 557, in check
E           return torch.linalg.cholesky_ex(value).info.eq(0)
E       RuntimeError: lazy wrapper should be called at most once

for others

they have 'DataParallel' object has no attribute 'backbone'

@WenjieDu WenjieDu added the bug Something isn't working label Mar 13, 2025
@WenjieDu WenjieDu self-assigned this Mar 13, 2025
@WenjieDu
Copy link
Owner Author

They are all fine when working on a single GPU. Hence, if one encounters the errors above, one should use one GPU only to run the models, or utilize CPU only.

@WenjieDu
Copy link
Owner Author

Now that CRLI, Koopa, and USGAN are fixed in #633. I'm going to change this issue's title.

@WenjieDu WenjieDu changed the title Some models fail when running on multiple CUDA devices FITS/FILM/GP-VAE fail when running on multiple CUDA devices Mar 15, 2025
@WenjieDu WenjieDu added the help wanted Extra atte 5183 ntion is needed label Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant
0