8000 CI broken due to Core issue · Issue #6166 · pytorch/vision · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

CI broken due to Core issue #6166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
NicolasHug opened this issue Jun 14, 2022 · 11 comments
Closed

CI broken due to Core issue #6166

NicolasHug opened this issue Jun 14, 2022 · 11 comments
Assignees

Comments

@NicolasHug
Copy link
Member

Our unittest jobs started failing with the latest nightly from core (from the 14th) with

Traceback (most recent call last):
  File "/root/project/test/test_backbone_utils.py", line 142, in test_build_fx_feature_extractor
    train_return_nodes, eval_return_nodes = self._get_return_nodes(model)
  File "/root/project/test/test_backbone_utils.py", line 130, in _get_return_nodes
    model, tracer_kwargs={"leaf_modules": self.leaf_modules}, suppress_diff_warning=True
  File "/root/project/torchvision/models/feature_extraction.py", line 253, in get_graph_node_names
    train_tracer.trace(model.train())
  File "/root/project/env/lib/python3.7/site-packages/torch/fx/_symbolic_trace.py", line 587, in trace
    self.create_node('output', 'output', (self.create_arg(fn(*args)),), {},
  File "/root/project/torchvision/models/swin_transformer.py", line 394, in forward
    x = self.features(x)
  File "/root/project/env/lib/python3.7/site-packages/torch/fx/_symbolic_trace.py", line 577, in module_call_wrapper
    return self.call_module(mod, forward, args, kwargs)
  File "/root/project/torchvision/models/feature_extraction.py", line 84, in call_module
    out = forward(*args, **kwargs)
  File "/root/project/env/lib/python3.7/site-packages/torch/fx/_symbolic_trace.py", line 573, in forward
    return _orig_module_call(mod, *args, **kwargs)
  File "/root/project/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/project/env/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/root/project/env/lib/python3.7/site-packages/torch/fx/_symbolic_trace.py", line 577, in module_call_wrapper
    return self.call_module(mod, forward, args, kwargs)
  File "/root/project/torchvision/models/feature_extraction.py", line 84, in call_module
    out = forward(*args, **kwargs)
  File "/root/project/env/lib/python3.7/site-packages/torch/fx/_symbolic_trace.py", line 573, in forward
    return _orig_module_call(mod, *args, **kwargs)
  File "/root/project/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/project/torchvision/models/swin_transformer.py", line 49, in forward
    x = F.pad(x, (0, 0, 0, W % 2, 0, H % 2))
TypeError: pad(): argument 'pad' (position 2) must be tuple of ints, not tuple

I'm unable to pin-point a specific PR in torch core though. Perhaps this is something @jamesr66a could help with? Thanks!

@datumbox

This comment was marked as off-topic.

@YosuaMichael
Copy link
Contributor

@datumbox I think that problem is not related to the broken CI.
My laptop use: torch==1.13.0.dev20220531 and when I do both pytest test_backbone_utils.py -k swin_t and pytest test_models.py -k swin_t both passes the test. However I still get the same error TypeError: int() argument must be a string, a bytes-like object or a number, not 'Proxy' when I run the script.

@YosuaMichael
Copy link
Contributor
YosuaMichael commented Jun 14, 2022

I can't really point out which pytorch core PR that cause this problem, however I think this issue can also be fixed on our side by wrapping around the padding function with torch.fx.wrap. Here is a simple example to trigger the error (must use the latest nightly):

import torch
import torch.fx
import torch.nn.functional as F

class CustomModule(torch.nn.Module):
    def forward(self, x):
        bs, c, h, w = x.shape
        return F.pad(x, (0, w%2, 0, h%2, 0, 0))

m = CustomModule()
x = torch.rand(1, 3, 4, 4)
m_fx = torch.fx.symbolic_trace(m)

and here is the quick fix (although it feels hacky...)

import torch
import torch.fx
import torch.nn.functional as F

fx_wrapped_pad = F.pad
torch.fx.wrap("fx_wrapped_pad")

class CustomModule(torch.nn.Module):
    def forward(self, x):
        bs, c, h, w = x.shape
        return fx_wrapped_pad(x, (0, w%2, 0, h%2, 0, 0))

m = CustomModule()
x = torch.rand(1, 3, 4, 4)
m_fx = torch.fx.symbolic_trace(m)

@datumbox
Copy link
Contributor

@YosuaMichael you are right, we can do that but whatever caused the change, it's breaking BC. So ideally we would like to have it fixed on Core to avoid downstream users outside of TorchVision.

@YosuaMichael
Copy link
Contributor
YosuaMichael commented Jun 14, 2022

@datumbox yeah thats right, it might break other downstream users as well.
Also, actually after trying to fix on the model file itself, it kinda harder to fix (the simple fix doesn't work).
Lets wait if anyone from the core team can help us on this.

@datumbox
Copy link
Contributor

Given that there were no recent updates on the FX area, the problem can be on the dispatcher or modifications affecting nn.functional.pad which is called in this case.

@vfdev-5 Could you please help us with a bisection to find the suspicious commit?

@vfdev-5 vfdev-5 self-assigned this Jun 14, 2022
@vfdev-5
Copy link
Collaborator
vfdev-5 commented Jun 15, 2022

Doing git bisect, it tells me that:

d332724071704939e1c50704f6bc62bb6c990383 is the first bad commit
commit d332724071704939e1c50704f6bc62bb6c990383
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Tue Jun 14 02:17:59 2022 +0000

    Python Bindings for SymInts (#78135)
    
    This PR adds support for `SymInt`s in python. Namely,
    * `THPVariable_size` now returns `sym_sizes()`
    * python arg parser is modified to parse PyObjects into ints and `SymbolicIntNode`s
    * pybind11 bindings for `SymbolicIntNode` are added, so size expressions can be traced
    * a large number of tests added to demonstrate how to implement python symints.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/78135
    Approved by: https://github.com/ezyang

pytorch/pytorch#78135

This commit breaks the snippet from @YosuaMichael , #6166 (comment):

import torch
import torch.fx
import torch.nn.functional as F

class CustomModule(torch.nn.Module):
    def forward(self, x):
        bs, c, h, w = x.shape
        return F.pad(x, (0, w%2, 0, h%2, 0, 0))

m = CustomModule()
x = torch.rand(1, 3, 4, 4)
m_fx = torch.fx.symbolic_trace(m)

EDIT:

Checked explicitly:

@datumbox datumbox changed the title CI broken due to FX issue CI broken due to Core issue Jun 15, 2022
@ezyang
Copy link
Contributor
ezyang commented Jun 15, 2022

How much do you trust this repro test to capture all the breakage

@datumbox
Copy link
Contributor

@ezyang it's a very new test in a very new model. So perhaps if you want to add a smoke test on your side, you can take the Yosua's snippet above. It's an issue that we haven't seen before so it's hard to tell how many other issues it can capture.

@vfdev-5
Copy link
Collaborator
vfdev-5 commented Jun 15, 2022

If we replace in the model defintion offending place x = F.pad(x, (0, 0, 0, W % 2, 0, H % 2)) with pure integers : x = F.pad(x, (0, 0, 0, 0, 0, 0)), FX related tests are now passing with nightly pytorch (1.13.0.dev20220615+cu113):

test/test_backbone_utils.py::TestFxFeatureExtraction::test_build_fx_feature_extractor[swin_t] PASSED                                                                       [ 12%]
test/test_backbone_utils.py::TestFxFeatureExtraction::test_forward_backward[swin_t] PASSED                                                                                 [ 25%]
test/test_backbone_utils.py::TestFxFeatureExtraction::test_jit_forward_backward[swin_t] PASSED   

So, I think this is the only place that breaks the CI

@datumbox
Copy link
Contributor

The problem is now fixed, thanks a lot for your help everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
0