8000 functionalization fix for inplace comparison ops by bdhirsh · Pull Request #77125 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

functionalization fix for inplace comparison ops #77125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

bdhirsh
Copy link
Contributor
@bdhirsh bdhirsh commented May 10, 2022

This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the torch.ge function, its dtype promotion logic diverges between the functional and inplace variant:

(1) torch.ge(a, b) always returned a bool tensor
(2) a.ge_(b) won't change the dtype of a. So the returned tensor is whatever a's dtype is

This means that if a user calls a.ge_(b) and we want to functionalize it into an at::ge(a, b) call, then the metadata on the inner tensor inside of FunctionalTensorWrapper will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for ge_ would be:

Before

$1 = torch._ops.aten.ge_.Scalar($0, 0)

After
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"

$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")

Stack from ghstack:

@facebook-github-bot
Copy link
Contributor
facebook-github-bot commented May 10, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit c05a1c0 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

set_storage_offset(value_.storage_offset());
if (dtype() != value_.unsafeGetTensorImpl()->dtype() || layout() != value_.unsafeGetTensorImpl()->layout()) {
value_ = value_.to(c10::TensorOptions().dtype(dtype()).layout(layout()));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I do this by making assumptions about which classes of metadata can be mutated by an operator

  • size/stride/storage offset can all be modified by an inplace op (e.g. out= ops that resize, or transpose_())
  • dtype/layout are never modified by an operator, which lets me assume that the dtype/layout on the wrapper are "correct", and that I can propagate them to the inner tensor if they diverge

Copy link
Contributor Author
@bdhirsh bdhirsh May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change (and some more further in the stack), I was able to get all of the LTC tests to pass, which makes me reasonable convinced that I haven't missed any weird cases.

Now that I think about it though, the absolutely correct thing to do would probably just to run the original operator with meta tensors, and compare the output meta tensor's metadata to the functional op's output.

We can't actually do that though without full meta tensor coverage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to do this with memory format too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes doing this with memory format will be good.
Another thought (that's for all functionalized inplace ops) - if a is broadcasted, inplace op will balk, but copy_to will happily proceed.

Copy link
Contributor Author
@bdhirsh bdhirsh May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Maybe I just need to run the original op with meta tensors to faithfully get all the correct error messages.

I can do that at least for inplace ops because they all have meta tensor impl's today. The inplace ops that aren't structured yet all have a generated a no-op meta tensor kernel. Which is... technically wrong though, since we won't error properly (but still better than nothing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there... a good way to check if two tensors have the same memory format? (what if they're both channels last, but one is contiguous and the other isn't?).

I guess I was thinking something like this, which looks pretty awful:

auto other_chan_last = other.unsafeGetTensorImpl()->is_strides_like_channels_last();
auto same_channels_last = is_strides_like_channels_last() == other_chan_last;

auto other_chan_last_3d = other.unsafeGetTensorImpl()->is_strides_like_channels_last_3d();
auto same_channels_last_3d = is_strides_like_channels_last_3d() == other_chan_last_3d

if (!same_channels_last || !same_channels_last_3d) {
    auto mem_format = other_chan_last ? MemoryFormat::ChannelsLast : other_chan_last_3d ? MemoryFormat::ChannelsLast3d : MemoryFormat::Preserve;
     value_ = value_.to(c10::TensorOptions().dtype(dtype()).layout(layout()), mem_format)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they have the same sizes? One way is to compare strides

@bdhirsh bdhirsh requested review from ezyang, zou3519 and albanD May 10, 2022 01:30
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
@ezyang
Copy link
Contributor
ezyang commented May 11, 2022

cc @mruberry @ngimel, this is the kind of thing primtorch wants to handle right

@mruberry
Copy link
Collaborator

cc @mruberry @ngimel, this is the kind of thing primtorch wants to handle right

I think it will; we intend to model inplace operations as a safe copy to the out tensor after the operation is performed, which is kinda what functionalization wants to do in the first place? This is how we model out= today

bdhirsh added 3 commits May 11, 2022 07:31
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
bdhirsh added 11 commits May 17, 2022 19:58
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**

9E88
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant:

(1) `torch.ge(a, b)` always returned a `bool` tensor
(2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is



This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for `ge_` would be:

**Before**
```
$1 = torch._ops.aten.ge_.Scalar($0, 0)
```
**After**
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"
```
$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")
```





[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request May 25, 2022
Summary:
Pull Request resolved: #77125

Approved by: https://github.com/ezyang

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/7ddc1425ff1582ab72e635fa7c4ace52357cdfc3

Reviewed By: mehtanirav

Differential Revision: D36668394

Pulled By: bdhirsh

fbshipit-source-id: d24138514fad48a382557d154d24e011b58cc820
@facebook-github-bot facebook-github-bot deleted the gh/bdhirsh/223/head branch May 28, 2022 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0