functionalization fix for inplace comparison ops #77125

bdhirsh · 2022-05-10T01:07:41Z

This is an interesting bug that surfaced from trying to integrate with LTC.

If you look at the torch.ge function, its dtype promotion logic diverges between the functional and inplace variant:

(1) torch.ge(a, b) always returned a bool tensor
(2) a.ge_(b) won't change the dtype of a. So the returned tensor is whatever a's dtype is

This means that if a user calls a.ge_(b) and we want to functionalize it into an at::ge(a, b) call, then the metadata on the inner tensor inside of FunctionalTensorWrapper will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype.

That actually means that the "correct" transformation for ge_ would be:

Before

$1 = torch._ops.aten.ge_.Scalar($0, 0)

After
Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong"

$1 = torch._ops.aten.ge.Scalar($0, 0)
$2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""")

Stack from ghstack:

[ghstack-poisoned]

facebook-github-bot · 2022-05-10T01:07:46Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/77125
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit c05a1c0 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

bdhirsh · 2022-05-10T01:27:50Z

aten/src/ATen/FunctionalTensorWrapper.cpp

+  set_storage_offset(value_.storage_offset());
+  if (dtype() != value_.unsafeGetTensorImpl()->dtype() || layout() != value_.unsafeGetTensorImpl()->layout()) {
+    value_ = value_.to(c10::TensorOptions().dtype(dtype()).layout(layout()));
+  }


Right now I do this by making assumptions about which classes of metadata can be mutated by an operator

size/stride/storage offset can all be modified by an inplace op (e.g. out= ops that resize, or transpose_())

dtype/layout are never modified by an operator, which lets me assume that the dtype/layout on the wrapper are "correct", and that I can propagate them to the inner tensor if they diverge

With this change (and some more further in the stack), I was able to get all of the LTC tests to pass, which makes me reasonable convinced that I haven't missed any weird cases.

Now that I think about it though, the absolutely correct thing to do would probably just to run the original operator with meta tensors, and compare the output meta tensor's metadata to the functional op's output.

We can't actually do that though without full meta tensor coverage.

Do you need to do this with memory format too?

Yes doing this with memory format will be good.
Another thought (that's for all functionalized inplace ops) - if a is broadcasted, inplace op will balk, but copy_to will happily proceed.

That's a good point. Maybe I just need to run the original op with meta tensors to faithfully get all the correct error messages.

I can do that at least for inplace ops because they all have meta tensor impl's today. The inplace ops that aren't structured yet all have a generated a no-op meta tensor kernel. Which is... technically wrong though, since we won't error properly (but still better than nothing).

Is there... a good way to check if two tensors have the same memory format? (what if they're both channels last, but one is contiguous and the other isn't?).

I guess I was thinking something like this, which looks pretty awful:

auto other_chan_last = other.unsafeGetTensorImpl()->is_strides_like_channels_last(); auto same_channels_last = is_strides_like_channels_last() == other_chan_last; auto other_chan_last_3d = other.unsafeGetTensorImpl()->is_strides_like_channels_last_3d(); auto same_channels_last_3d = is_strides_like_channels_last_3d() == other_chan_last_3d if (!same_channels_last || !same_channels_last_3d) { auto mem_format = other_chan_last ? MemoryFormat::ChannelsLast : other_chan_last_3d ? MemoryFormat::ChannelsLast3d : MemoryFormat::Preserve; value_ = value_.to(c10::TensorOptions().dtype(dtype()).layout(layout()), mem_format) }

Do they have the same sizes? One way is to compare strides

This is an interesting bug that surfaced from trying to integrate with LTC. If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant: (1) `torch.ge(a, b)` always returned a `bool` tensor (2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype. That actually means that the "correct" transformation for `ge_` would be: **Before** ``` $1 = torch._ops.aten.ge_.Scalar($0, 0) ``` **After** Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong" ``` $1 = torch._ops.aten.ge.Scalar($0, 0) $2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""") ``` [ghstack-poisoned]

ezyang · 2022-05-11T02:33:39Z

cc @mruberry @ngimel, this is the kind of thing primtorch wants to handle right

mruberry · 2022-05-11T02:38:07Z

cc @mruberry @ngimel, this is the kind of thing primtorch wants to handle right

I think it will; we intend to model inplace operations as a safe copy to the out tensor after the operation is performed, which is kinda what functionalization wants to do in the first place? This is how we model out= today

This is an interesting bug that surfaced from trying to integrate with LTC. If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant: (1) `torch.ge(a, b)` always returned a `bool` tensor (2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype. That actually means that the "correct" transformation for `ge_` would be: **Before** ``` $1 = torch._ops.aten.ge_.Scalar($0, 0) ``` **After** Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong" ``` $1 = torch._ops.aten.ge.Scalar($0, 0) $2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""") ``` [ghstack-poisoned]

This is an interesting bug that surfaced from trying to integrate with LTC. If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant: (1) `torch.ge(a, b)` always returned a `bool` tensor (2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype. That actually means that the "correct" transformation for `ge_` would be: **Before** 9E88 ``` $1 = torch._ops.aten.ge_.Scalar($0, 0) ``` **After** Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong" ``` $1 = torch._ops.aten.ge.Scalar($0, 0) $2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""") ``` [ghstack-poisoned]

This is an interesting bug that surfaced from trying to integrate with LTC. If you look at the `torch.ge` function, its dtype promotion logic diverges between the functional and inplace variant: (1) `torch.ge(a, b)` always returned a `bool` tensor (2) `a.ge_(b)` won't change the dtype of `a`. So the returned tensor is whatever `a`'s dtype is This means that if a user calls `a.ge_(b)` and we want to functionalize it into an `at::ge(a, b)` call, then the metadata on the inner tensor inside of `FunctionalTensorWrapper` will be wrong! When we eventually pop out of functionalization (and return the inner tensor back to the user), it will have the wrong dtype. That actually means that the "correct" transformation for `ge_` would be: **Before** ``` $1 = torch._ops.aten.ge_.Scalar($0, 0) ``` **After** Manually perform a dtype cast afterwards if metadata from the functional op call is "wrong" ``` $1 = torch._ops.aten.ge.Scalar($0, 0) $2 = torch._ops.aten._to_copy.default($1, dtype=6, layout=0)""") ``` [ghstack-poisoned]

Summary: Pull Request resolved: #77125 Approved by: https://github.com/ezyang Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/7ddc1425ff1582ab72e635fa7c4ace52357cdfc3 Reviewed By: mehtanirav Differential Revision: D36668394 Pulled By: bdhirsh fbshipit-source-id: d24138514fad48a382557d154d24e011b58cc820

functionalization fix for inplace comparison ops

6bd9211

[ghstack-poisoned]

facebook-github-bot added the cla signed label May 10, 2022

bdhirsh commented May 10, 2022

View reviewed changes

bdhirsh requested review from ezyang, zou3519 and albanD May 10, 2022 01:30

ezyang approved these changes May 11, 2022

View reviewed changes

bdhirsh added 3 commits May 11, 2022 07:31

bdhirsh added 11 commits May 17, 2022 19:58

pytorchmergebot closed this in 7ddc142 May 24, 2022

facebook-github-bot deleted the gh/bdhirsh/223/head branch May 28, 2022 14:16

bdhirsh mentioned this pull request Oct 8, 2024

aot_eager does not error on mismatching data-types in out tensors. #137213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

functionalization fix for inplace comparison ops #77125

functionalization fix for inplace comparison ops #77125

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

functionalization fix for inplace comparison ops #77125

functionalization fix for inplace comparison ops #77125

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!