Forward-over-reverse gradgradcheck fails on CUDA for `div.floor_rounding` #69913

soulitzer · 2021-12-14T18:16:32Z

🐛 Describe the bug

At the time of posting this issue, need to checkout #69740 to replicate.

I also preemptively skipped the same test for floor_rounding and trunc_rounding variants because there seemed to be related skips for forward mode AD already.

The stack trace seems wrong as the error suggests:

 ======================================================================
ERROR [0.131s]: test_fn_fwgrad_bwgrad_div_floor_rounding_cuda_complex128 (__main__.TestGradientsCUDA)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1482, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 381, in instantiated_test
    raise rte
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 376, in instantiated_test
    result = test(self, **param_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 753, in test_wrapper
    return test(*args, **kwargs)
  File "test_ops.py", line 827, in test_fn_fwgrad_bwgrad
    self._check_helper(device, dtype, op, op.get_op(), "fwgrad_bwgrad")
  File "test_ops.py", line 776, in _check_helper
    self.assertTrue(gradgradcheck(fn, gradcheck_args, **kwargs))
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 2893, in gradgradcheck
    return torch.autograd.gradgradcheck(fn, inputs, grad_outputs, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 1548, in gradgradcheck
    check_forward_ad=check_fwd_over_rev, check_backward_ad=check_rev_over_rev)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 1398, in gradcheck
    return _gradcheck_helper(**args)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 1415, in _gradcheck_helper
    check_undefined_grad=check_undefined_grad)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 1070, in _gradcheck_real_imag
    complex_indices=complex_inp_indices, test_imag=True, use_forward_ad=True)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 1290, in _fast_gradcheck
    inputs, outputs, func, all_v, all_u, rtol, atol, test_imag, is_forward_ad=use_forward_ad)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 1259, in _check_analytical_numerical_equal
    updated_atol = _adjusted_atol(atol, all_u[i], all_v[j] if all_v else None)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 1173, in _adjusted_atol
    return atol * float(sum_u) * float(sum_v)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Versions

At the time of posting this issue, need to checkout #69740 to replicate.
Otherwise, main branch

cc @ezyang @gchanan @zou3519 @albanD @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7

The text was updated successfully, but these errors were encountered:

zou3519 · 2021-12-14T18:51:25Z

We're getting this as well in functorch FWIW. a bunch of division related operations are failing with this assert:
https://github.com/pytorch/functorch/blob/8a60047c72c15b36ecf5a47c76b39bc82135f244/test/test_ops.py#L572-L576

Do you know why the batched-forward gradcheck didn't catch this in PyTorch?

soulitzer · 2021-12-14T19:12:43Z

Yeah its being skipped already for forward-mode AD in general, though there was no comment/issue linked (so I'm not sure why it was originally skipped).

ngimel · 2022-01-28T20:58:02Z

All issues causing IMAs should be hi priority and should be fixed ASAP.
I know with the zero tensor PR these tests start passing, that doesn't solve the problem of some combination of inputs (with real, not zero tensors) causing IMAs.

soulitzer · 2022-01-28T21:29:00Z

Good point. These tests do cover non-zero tensor inputs as well though, so as long as the tests pass we can close this issue.

…ors" Fixes #71160 #69925 #69913 Differential Revision: [D33897543](https://our.internmc.facebook.com/intern/diff/D33897543) [ghstack-poisoned]

Fixes #71160 #69925 #69913 Differential Revision: [D33897543](https://our.internmc.facebook.com/intern/diff/D33897543) [ghstack-poisoned]

…ors" Fixes #71160 #69925 #69913 Differential Revision: [D33897543](https://our.internmc.facebook.com/intern/diff/D33897543) [ghstack-poisoned]

Fixes #71160 #69925 #69913 Differential Revision: [D33897543](https://our.internmc.facebook.com/intern/diff/D33897543) [ghstack-poisoned]

…ors" Fixes #71160 #69925 #69913 Differential Revision: [D33897543](https://our.internmc.facebook.com/intern/diff/D33897543) [ghstack-poisoned]

Fixes #71160 #69925 #69913 Differential Revision: [D33897543](https://our.internmc.facebook.com/intern/diff/D33897543) [ghstack-poisoned]

Summary: Pull Request resolved: #71611 Fixes #71160 #69925 #69913 Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33897543 Pulled By: anjali411 fbshipit-source-id: f1d8608c351876b8c2619da5ef891f74bad30ab5

Summary: Pull Request resolved: #71611 Fixes #71160 #69925 #69913 Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33897543 Pulled By: anjali411 fbshipit-source-id: f1d8608c351876b8c2619da5ef891f74bad30ab5 (cherry picked from commit 643e666)

soulitzer · 2022-02-08T16:55:18Z

Fixed by #71611

soulitzer added module: autograd Related to torch.autograd, and the autograd engine in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 14, 2021

ngimel added the high priority label Jan 28, 2022

pytorch-bot bot added the triage review label Jan 28, 2022

ngimel removed the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 28, 2022

anjali411 mentioned this issue Jan 28, 2022

Set correct device id on efficientzerotensors #71611

Closed

mrshenli removed the triage review label Jan 31, 2022

VitalyFedyunin added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 1, 2022

anjali411 added a commit that referenced this issue Feb 2, 2022

Update base for Update on "Set correct device id on efficientzerotens…

16f7967

…ors" Fixes #71160 #69925 #69913 Differential Revision: [D33897543](https://our.internmc.facebook.com/intern/diff/D33897543) [ghstack-poisoned]

anjali411 added a commit that referenced this issue Feb 2, 2022

Update on "Set correct device id on efficientzerotensors"

dcc9164

Fixes #71160 #69925 #69913 Differential Revision: [D33897543](https://our.internmc.facebook.com/intern/diff/D33897543) [ghstack-poisoned]

anjali411 added a commit that referenced this issue Feb 2, 2022

Update base for Update on "Set correct device id on efficientzerotens…

a4dc624

…ors" Fixes #71160 #69925 #69913 Differential Revision: [D33897543](https://our.internmc.facebook.com/intern/diff/D33897543) [ghstack-poisoned]

anjali411 added a commit that referenced this issue Feb 2, 2022

Update on "Set correct device id on efficientzerotensors"

970d238

Fixes #71160 #69925 #69913 Differential Revision: [D33897543](https://our.internmc.facebook.com/intern/diff/D33897543) [ghstack-poisoned]

anjali411 added a commit that referenced this issue Feb 2, 2022

Update base for Update on "Set correct device id on efficientzerotens…

5522c94

…ors" Fixes #71160 #69925 #69913 Differential Revision: [D33897543](https://our.internmc.facebook.com/intern/diff/D33897543) [ghstack-poisoned]

anjali411 added a commit that referenced this issue Feb 2, 2022

Update on "Set correct device id on efficientzerotensors"

cf911cf

Fixes #71160 #69925 #69913 Differential Revision: [D33897543](https://our.internmc.facebook.com/intern/diff/D33897543) [ghstack-poisoned]

soulitzer closed this as completed Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Forward-over-reverse gradgradcheck fails on CUDA for `div.floor_rounding` #69913

Forward-over-reverse gradgradcheck fails on CUDA for `div.floor_rounding` #69913

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Forward-over-reverse gradgradcheck fails on CUDA for div.floor_rounding #69913

Forward-over-reverse gradgradcheck fails on CUDA for div.floor_rounding #69913

Comments

Uh oh!

🐛 Describe the bug

Versions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Forward-over-reverse gradgradcheck fails on CUDA for `div.floor_rounding` #69913

Forward-over-reverse gradgradcheck fails on CUDA for `div.floor_rounding` #69913