8000 [jit] support tracing tensor __setitem__ with dynamic shape by ppwwyyxx · Pull Request #45828 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[jit] support tracing tensor __setitem__ with dynamic shape #45828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

ppwwyyxx
Copy link
Collaborator
@ppwwyyxx ppwwyyxx commented Oct 5, 2020

Summary: fix #43548

Test Plan: buck test mode/dev-nosan //caffe2/test:jit -- 'test_trace_slice' --jobs 1

Differential Revision: D24106641

Summary: fix pytorch#43548

Test Plan: buck test mode/dev-nosan //caffe2/test:jit -- 'test_trace_slice' --jobs 1

Differential Revision: D24106641

fbshipit-source-id: 240a6a69f44bccb277e21fd76406e31e48c8968b
@facebook-github-bot facebook-github-bot added oncall: jit Add this issue/PR to JIT oncall triage queue fb-exported labels Oct 5, 2020
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24106641

@gmagogsfm
Copy link
Contributor

@eellison could you help take a look? I am not familiar with this part of code.

@eellison
Copy link
Contributor
eellison commented Oct 9, 2020

requesting review from Pytorch eager folks since i'm also unfamiliar with this part of the code @zou3519 @gchanan @ezyang

@ppwwyyxx ppwwyyxx requested review from zou3519, gchanan and ezyang and removed request for gmagogsfm and eellison October 9, 2020 19:54
@dr-ci
Copy link
dr-ci bot commented Oct 12, 2020

💊 CI failures summary and remediations

As of commit 2719ba8 (more details on the Dr. CI page):


  • 4/4 failures introduced in this PR

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test2 (1/4)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Oct 13 19:22:26 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:11:3 in
Oct 13 19:22:26     #7 0x55bafc7b770b in PyEval_EvalCode /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:731 
Oct 13 19:22:26     #8 0x55bafc837573 in run_mod /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:1025 
Oct 13 19:22:26     #9 0x55bafc83760c in PyRun_StringFlags /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:949 
Oct 13 19:22:26     #10 0x55bafc83766e in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:445 
Oct 13 19:22:26     #11 0x55bafc83b472 in run_command /tmp/build/80754af9/python_1599604603603/work/Modules/main.c:301 
Oct 13 19:22:26     #12 0x55bafc83b472 in Py_Main /tmp/build/80754af9/python_1599604603603/work/Modules/main.c:749 
Oct 13 19:22:26     #13 0x55bafc70543d in main /tmp/build/80754af9/python_1599604603603/work/Programs/python.c:69 
Oct 13 19:22:26     #14 0x7eff6c42383f in __libc_start_main /build/glibc-e6zv40/glibc-2.23/csu/../csu/libc-start.c:291 
Oct 13 19:22:26     #15 0x55bafc7e4d0a in _start /home/rdonnelly/mc/conda-bld/compilers_linux-64_1534865402226/work/.build/src/glibc-2.12.2/csu/../sysdeps/x86_64/elf/start.S:103 
Oct 13 19:22:26  
Oct 13 19:22:26 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:11:3 in  
Oct 13 19:22:26 + retcode=1 
Oct 13 19:22:26 + set -e 
Oct 13 19:22:26 + return 1 
Oct 13 19:22:26 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX-* ]] 
Oct 13 19:22:26 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX2-* ]] 
Oct 13 19:22:26 + '[' -n https://github.com/pytorch/pytorch/pull/45828 ']' 
Oct 13 19:22:26 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 != *coverage* ]] 
Oct 13 19:22:26 ++ mktemp 
Oct 13 19:22:26 + DETERMINE_FROM=/tmp/tmp.yeID1JkanU 
Oct 13 19:22:26 + file_diff_from_base /tmp/tmp.yeID1JkanU 

See CircleCI build pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (2/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/docker/common/install_conda.sh 
Auto-merging .circleci/docker/common/install_conda.sh 
CONFLICT (add/add): Merge conflict in .circleci/config.yml 
Auto-merging .circleci/config.yml 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/ge_config_tests.py 
Auto-merging .circleci/cimodel/data/simple/ge_config_tests.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_definitions.py 
Auto-merging .circleci/cimodel/data/pytorch_build_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py 
Auto-merging .circleci/cimodel/data/pytorch_build_data.py 
Automatic merge failed; fix conflicts and then commit the result. 

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (3/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/docker/common/install_conda.sh 
Auto-merging .circleci/docker/common/install_conda.sh 
CONFLICT (add/add): Merge conflict in .circleci/config.yml 
Auto-merging .circleci/config.yml 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/ge_config_tests.py 
Auto-merging .circleci/cimodel/data/simple/ge_config_tests.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_definitions.py 
Auto-merging .circleci/cimodel/data/pytorch_build_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py 
Auto-merging .circleci/cimodel/data/pytorch_build_data.py 
Automatic merge failed; fix conflicts and then commit the result. 

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (4/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 r 8000 erun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/docker/common/install_conda.sh 
Auto-merging .circleci/docker/common/install_conda.sh 
CONFLICT (add/add): Merge conflict in .circleci/config.yml 
Auto-merging .circleci/config.yml 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/ge_config_tests.py 
Auto-merging .circleci/cimodel/data/simple/ge_config_tests.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_definitions.py 
Auto-merging .circleci/cimodel/data/pytorch_build_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py 
Auto-merging .circleci/cimodel/data/pytorch_build_data.py 
Automatic merge failed; fix conflicts and then commit the result. 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 8 times.

@ezyang ezyang requested a review from yf225 October 12, 2020 21:22
@ezyang
Copy link
Contributor
ezyang commented Oct 12, 2020

I can believe that this "helps", but the fix is still subtly wrong. You are still baking in static sizes, by fastpathing to copy_. If at runtime someone gives you tensors whose sizes don't match anymore, this still won't work, as there needed to be a broadcast that isn't happening in the trace. So the "correct" way would be to replace the explicit size materialization with something traceable.

I don't have enough information to assess if the hack is worth it.

@ezyang
Copy link
Contributor
ezyang commented Oct 12, 2020

@robieta I want to suggest that indexing is something we should consider putting under benchmarking; when @yf225 and I were working on this code originally this code was very sensitive to dynamic allocations.

@robieta
Copy link
robieta commented Oct 12, 2020

@robieta I want to suggest that indexing is something we should consider putting under benchmarking; when @yf225 and I were working on this code originally this code was very sensitive to dynamic allocations.

I currently have x[1] (setup=y = torch.ones((10,))) in the ad-hoc callgrind benchmarking that I'm doing. (And will become CI) But of course that's not super comprehensive. What other cases would you recommend? (Slicing, other Tensors as indices, etc.)

8000

@ppwwyyxx
Copy link
Collaborator Author

Yeah this is not perfect and I think a perfect solution for tracing to support dynamic sizes everywhere is very hard.

However, the edge case that's left in this PR can be easily workaround: if one needs to do a[...] = b where b needs broadcasting, in order to correctly support tracing (after this PR) they only have to do a[...] = b.reshape(xx.size(), ...) because .size() in Python can be correctly traced.
On the other hand, without this PR it is more difficult to workaround such an indexing operation for tracing (at least I can't think of an easy workaround).

// A shortcut to avoid generating hard-coded constant sizes during tracing.
dst.copy_(src);
return;
}
Copy link
Contributor
@yf225 yf225 Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to test whether this would hurt indexing perf for the fallback path. It would be awesome to collect numbers like the following (copy-pasted from #32841 (comment)):

Test Old (us) New (us) % difference (negative: speedup)
v[0] 1.68 1.69 0.60%
v[0] = 1 7.33 7.04 -3.96%
v[:] 1.67 1.69 1.20%
v[:] = 1 7.58 6.98 -7.92%
v[...] 1.29 1.29 0.00%
v[...] = 1 5.6 5.33 -4.82%
v[None] 1.67 1.68 0.60%
v[None] = 1 7.39 7.32 -0.95%
v[False] 9.37 9.03 -3.63%
v[False] = 1 1.15 1 -13.04%
v[True] 12.3 11.7 -4.88%
v[True] = 1 7.52 7.22 -3.99%
v[0, 0] 3.32 3.38 1.81%
v[0, 0] = 1 9 8.83 -1.89%
a[b, None, ..., :, True] 37.8 37.4 -1.06%
a[b, None, ..., :, True] = 1 39.4 39.2 -0.51%

to better understand the impact on performance (note: we might need to change the test cases to make sure we are executing the fallback path).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. Which code produces this table?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what I used:

#!/bin/bash

# Declare an array of string with type
declare -a StmtArray=(
  "python -m timeit -s 'import torch; v = torch.randn(1,1,1);' 'v[0]'"
  "python -m timeit -s 'import torch; v = torch.randn(1,1,1);' 'v[0] = 1'"
  "python -m timeit -s 'import torch; v = torch.randn(1,1,1);' 'v[0, 0]'"
  "python -m timeit -s 'import torch; v = torch.randn(1,1,1);' 'v[0, 0] = 1'"
  "python -m timeit -s 'import torch; v = torch.randn(1,1,1);' 'v[...]'"
  "python -m timeit -s 'import torch; v = torch.randn(1,1,1);' 'v[...] = 1'"
  "python -m timeit -s 'import torch; v = torch.randn(1,1,1);' 'v[:]'"
  "python -m timeit -s 'import torch; v = torch.randn(1,1,1);' 'v[:] = 1'"
  "python -m timeit -s 'import torch; v = torch.randn(1,1,1);' 'v[None]'"
  "python -m timeit -s 'import torch; v = torch.randn(1,1,1);' 'v[None] = 1'"
  "python -m timeit -s 'import torch; v = torch.randn(1,1,1);' 'v[False]'"
  "python -m timeit -s 'import torch; v = torch.randn(1,1,1);' 'v[False] = 1'"
  "python -m timeit -s 'import torch; v = torch.randn(1,1,1);' 'v[True]'"
  "python -m timeit -s 'import torch; v = torch.randn(1,1,1);' 'v[True] = 1'"
  "python -m timeit -s 'import torch; a = torch.zeros(100, 100, 1, 1, 1); b = torch.arange(99, -1, -1).long(); ' 'a[b, None, ..., :, True]'"
  "python -m timeit -s 'import torch; a = torch.zeros(100, 100, 1, 1, 1); b = torch.arange(99, -1, -1).long(); ' 'a[b, None, ..., :, True] = 1'"
)

for (( i = 0; i < ${#StmtArray[@]} ; i++ )); do
  printf "\n**** Running: ${StmtArray[$i]} *****\n\n"
  for (( j = 0 ; j < 20 ; j++ )); do
    eval "${StmtArray[$i]}"
  done
done

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way to do it nowadays is something like:

from torch.utils.benchmark import Timer, Compare

SCALAR_SETUP = "v = torch.randn(1,1,1)"
VECTOR_SETUP = "a = torch.zeros(100, 100, 1, 1, 1); b = torch.arange(99, -1, -1).long()"
TASKS = (
    (SCALAR_SETUP, 'v[0]'),
    (SCALAR_SETUP, 'v[0] = 1'),
    (SCALAR_SETUP, 'v[0, 0]'),
    (SCALAR_SETUP, 'v[0, 0] = 1'),
    (SCALAR_SETUP, 'v[...]'),
    (SCALAR_SETUP, 'v[...] = 1'),
    (SCALAR_SETUP, 'v[:]'),
    (SCALAR_SETUP, 'v[:] = 1'),
    (SCALAR_SETUP, 'v[None]'),
    (SCALAR_SETUP, 'v[None] = 1'),
    (SCALAR_SETUP, 'v[False]'),
    (SCALAR_SETUP, 'v[False] = 1'),
    (SCALAR_SETUP, 'v[True]'),
    (SCALAR_SETUP, 'v[True] = 1'),
    (VECTOR_SETUP, 'a[b, None, ..., :, True]'),
    (VECTOR_SETUP, 'a[b, None, ..., :, True] = 1'),
)

results = []
for setup, stmt in TASKS:
    timer = Timer(
        stmt,
        setup=setup,
        description="HEAD",
    )
    results.append(timer.blocked_autorange(min_run_time=1))

cmp = Compare(results)
# cmp.trim_significant_figures()
cmp.print()
[--------------------  -------------------]
                                    |  HEAD
1 threads: --------------------------------
      v[0]                          |   1.7
      v[0] = 1                      |   7.2
      v[0, 0]                       |   3.3
      v[0, 0] = 1                   |   9.5
      v[...]                        |   1.3
      v[...] = 1                    |   6.0
      v[:]                          |   1.6
      v[:] = 1                      |   6.7
      v[None]                       |   1.6
      v[None] = 1                   |   7.0
      v[False]                      |   8.3
      v[False] = 1                  |   1.1
      v[True]                       |  11.3
      v[True] = 1                   |   7.0
      a[b, None, ..., :, True]      |  33.2
      a[b, None, ..., :, True] = 1  |  35.1

Times are in microseconds (us).

trim_significant_figures is currently a bit conservative with really fast ops so I didn't use it here, but in the future it will be recommended.

Copy link
Collaborator Author
@ppwwyyxx ppwwyyxx Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I added one more setindex testcase with same shape. These are the results after running both old and new code twice.

$cat old*.txt
[---------------------  ---------------------]
                                    |    HEAD
1 threads: -----------------------------------
      v[0] = 1                      |   5830.6
      v[0, 0] = 1                   |   6796.2
      v[0, 0, 0] = 1                |   6797.3
      v[...] = 1                    |   4329.7
      v[:] = 1                      |   5482.9
      v[None] = 1                   |   5664.0
      v[False] = 1                  |    964.3
      v[True] = 1                   |   5707.1
      a[b, None, ..., :, True] = 1  |  30175.3

Times are in nanoseconds (ns).

[---------------------  ---------------------]
                                    |    HEAD
1 threads: -----------------------------------
      v[0] = 1                      |   5890.9
      v[0, 0] = 1                   |   7040.5
      v[0, 0, 0] = 1                |   7034.4
      v[...] = 1                    |   4518.2
      v[:] = 1                      |   5635.5
      v[None] = 1                   |   5664.1
      v[False] = 1                  |    937.6
      v[True] = 1                   |   5705.7
      a[b, None, ..., :, True] = 1  |  29280.1

Times are in nanoseconds (ns).

$cat new*.txt
[---------------------  ---------------------]
                                    |    HEAD
1 threads: -----------------------------------
      v[0] = 1                      |   5597.9
      v[0, 0] = 1                   |   6769.5
      v[0, 0, 0] = 1                |   5938.4
      v[...] = 1                    |   4368.9
      v[:] = 1                      |   5516.4
      v[None] = 1                   |   5608.9
      v[False] = 1                  |    978.5
      v[True] = 1                   |   5643.1
      a[b, None, ..., :, True] = 1  |  32076.3

Times are in nanoseconds (ns).

[---------------------  ---------------------]
                                    |    HEAD
1 threads: -----------------------------------
      v[0] = 1                      |   5842.8
      v[0, 0] = 1                   |   6934.9
      v[0, 0, 0] = 1                |   5923.4
      v[...] = 1                    |   4381.3
      v[:] = 1                      |   5548.5
      v[None] = 1                   |   5705.6
      v[False] = 1                  |    972.3
      v[True] = 1                   |   5658.0
      a[b, None, ..., :, True] = 1  |  30968.8

Times are in nanoseconds (ns).

The new test case v[0, 0, 0] = 1 is considerably faster. Hard to obtain signals from the others as they seem to be within noise range. Maybe the last one gets slower? I can test more if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robieta To answer your question further above, this benchmark is the one that I'd expect to see added.

@vadimkantorov
Copy link
Contributor
vadimkantorov commented Oct 12, 2020

I propose some verbose mode for tracing that lets uncover some potentially incorrect traces (or a warning) wrt indexing. It's better than a silent recording of static shapes and then failure at runtime and implicitly divergent behavior.

It would also be cool to allow adding some "shape hints" to help shape inference, that would e.g. guarantee dimensionality, some static shapes or equality of a dim of one tensor to another (or to some known "dim symbol").

@ezyang
Copy link
Contributor
ezyang commented Oct 13, 2020

However, the edge case that's left in this PR can be easily workaround: if one needs to do a[...] = b where b needs broadcasting, in order to correctly support tracing (after this PR) they only have to do a[...] = b.reshape(xx.size(), ...) because .size() in Python can be correctly traced.
On the other hand, without this PR it is more difficult to workaround such an indexing operation for tracing (at least I can't think of an easy workaround).

OK, this is sufficient justification to me.

Copy link
Contributor
@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you bulk up the comment on the shortcut to include the discussion that we had here, including the suggested workaround?

@vadimkantorov's comment seems relevant in the long term, maybe file a follow up issue about this too.

Copy link
Contributor
@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppwwyyxx has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ppwwyyxx
Copy link
Collaborator Author

@vadimkantorov do you want to open a follow up issue and elaborate more about the ideas?

@facebook-github-bot
Copy link
Contributor

@ppwwyyxx merged this pull request in b64cf93.

@fmassa
Copy link
Member
fmassa commented Oct 14, 2020

Are we still developing tracing functionality? I thought we would be moving away from tracing to instead rely on scripting the model.

@vadimkantorov
Copy link
Contributor
vadimkantorov commented Oct 14, 2020 via email

@vadimkantorov
Copy link
Contributor
vadimkantorov commented Oct 23, 2020

@ppwwyyxx @ezyang I had two proposals:

  1. some verbose flag to access potential wanings
  2. expressing shaping information a bit similar to [discussion] Expressing tensor dimension semantics / constraints through typing / constraints blocks. Constraints block could be scripted/traced and help for tracing/script execution and codegen #40373. E.g. allowing tracing some constraints like assert size(X, 1) == size(Y, 2) and size(X, 2) == 64 and ndim(Z) == 3 etc and then using them for compilation/optimization (in my example in [discussion] Expressing tensor dimension semantics / constraints through typing / constraints blocks. Constraints block could be scripted/traced and help for tracing/script execution and codegen #40373 I annotate tensors with strings like BTC and BC and then check that similarly named dimensions of different tensors agree)

@ezyang
Copy link
Contributor
ezyang commented Oct 26, 2020

@vadimkantorov wanna stick these in some fresh issues?

@vadimkantorov
Copy link
Contributor
vadimkantorov commented Oct 26, 2020

#40373 covers it, doesn't it? There's no substantial development, but motivation is well-discussed there, no?

@ezyang
Copy link
Contributor
ezyang commented Oct 26, 2020

Maybe? That issue as a whole is not so likely to get resolved in the near future, but more focused fixes for tracing might be easier to tackle.

@vadimkantorov
Copy link
Contributor

I added an UPD there and will rename the issue as well!

@vadimkantorov
Copy link
Contributor

Okay, I'll make a new issue as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fb-exported Merged oncall: jit Add this issue/PR to JIT oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT tracing incorrectly records some slice bounds
9 participants
0