8000 Forward-over-reverse gradgradcheck fails on CUDA for `div.floor_rounding` · Issue #69913 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
8000

Forward-over-reverse gradgradcheck fails on CUDA for div.floor_rounding #69913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
soulitzer opened this issue Dec 14, 2021 · 5 comments
Closed
Labels
high priority module: autograd Related to torch.autograd, and the autograd engine in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@soulitzer
Copy link
Contributor
soulitzer commented Dec 14, 2021

🐛 Describe the bug

At the time of posting this issue, need to checkout #69740 to replicate.

I also preemptively skipped the same test for floor_rounding and trunc_rounding variants because there seemed to be related skips for forward mode AD already.

The stack trace seems wrong as the error suggests:

 ======================================================================
ERROR [0.131s]: test_fn_fwgrad_bwgrad_div_floor_rounding_cuda_complex128 (__main__.TestGradientsCUDA)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1482, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 381, in instantiated_test
    raise rte
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 376, in instantiated_test
    result = test(self, **param_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 753, in test_wrapper
    return test(*args, **kwargs)
  File "test_ops.py", line 827, in test_fn_fwgrad_bwgrad
    self._check_helper(device, dtype, op, op.get_op(), "fwgrad_bwgrad")
  File "test_ops.py", line 776, in _check_helper
    self.assertTrue(gradgradcheck(fn, gradcheck_args, **kwargs))
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 2893, in gradgradcheck
    return torch.autograd.gradgradcheck(fn, inputs, grad_outputs, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 1548, in gradgradcheck
    check_forward_ad=check_fwd_over_rev, check_backward_ad=check_rev_over_rev)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 1398, in gradcheck
    return _gradcheck_helper(**args)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 1415, in _gradcheck_helper
    check_undefined_grad=check_undefined_grad)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 1070, in _gradcheck_real_imag
    complex_indices=complex_inp_indices, test_imag=True, use_forward_ad=True)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 1290, in _fast_gradcheck
    inputs, outputs, func, all_v, all_u, rtol, atol, test_imag, is_forward_ad=use_forward_ad)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 1259, in _check_analytical_numerical_equal
    updated_atol = _adjusted_atol(atol, all_u[i], all_v[j] if all_v else None)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 1173, in _adjusted_atol
    return atol * float(sum_u) * float(sum_v)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Versions

At the time of posting this issue, need to checkout #69740 to replicate.
Otherwise, main branch

cc @ezyang @gchanan @zou3519 @albanD @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7

@soulitzer soulitzer added module: autograd Related to torch.autograd, and the autograd engine in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 14, 2021
@zou3519
Copy link
Contributor
zou3519 commented Dec 14, 2021

We're getting this as well in functorch FWIW. a bunch of division related operations are failing with this assert:
https://github.com/pytorch/functorch/blob/8a60047c72c15b36ecf5a47c76b39bc82135f244/test/test_ops.py#L572-L576

Do you know why the batched-forward gradcheck didn't catch this in PyTorch?

@soulitzer
Copy link
Contributor Author
soulitzer commented Dec 14, 2021

Yeah its being skipped already for forward-mode AD in general, though there was no comment/issue linked (so I'm not sure why it was originally skipped).

@ngimel ngimel removed the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 28, 2022
@ngimel
Copy link
Collaborator
ngimel commented Jan 28, 2022

All issues causing IMAs should be hi priority and should be fixed ASAP.
I know with the zero tensor PR these tests start passing, that doesn't solve the problem of some combination of inputs (with real, not zero tensors) causing IMAs.

@soulitzer
Copy link
Contributor Author
soulitzer commented Jan 28, 2022

Good point. These tests do cover non-zero tensor inputs as well though, so as long as the tests pass we can close this issue.

@VitalyFedyunin VitalyFedyunin added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 1, 2022
anjali411 added a commit that referenced this issue Feb 2, 2022
anjali411 added a commit that referenced this issue Feb 2, 2022
anjali411 added a commit that referenced this issue Feb 2, 2022
anjali411 added a commit that referenced this issue Feb 2, 2022
anjali411 added a commit that referenced this issue Feb 2, 2022
anjali411 added a commit that referenced this issue Feb 2, 2022
facebook-github-bot pushed a commit that referenced this issue Feb 2, 2022
Summary:
Pull Request resolved: #71611

Fixes #71160 #69925 #69913

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D33897543

Pulled By: anjali411

fbshipit-source-id: f1d8608c351876b8c2619da5ef891f74bad30ab5
pytorchmergebot pushed a commit that referenced this issue Feb 2, 2022
Summary:
Pull Request resolved: #71611

Fixes #71160 #69925 #69913

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D33897543

Pulled By: anjali411

fbshipit-source-id: f1d8608c351876b8c2619da5ef891f74bad30ab5
(cherry picked from commit 643e666)
@soulitzer
Copy link
Contributor Author

Fixed by #71611

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: autograd Related to torch.autograd, and the autograd engine in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants
0