8000 [chalf] enable testing for multiple ops by kshitij12345 · Pull Request #77405 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[chalf] enable testing for multiple ops #77405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

kshitij12345
Copy link
Collaborator
@kshitij12345 kshitij12345 commented May 13, 2022

Ref: #74537

Enable for permute, split, split_with_sizes, select, ravel, reshape, reshape_as, unfold, squeeze, unsqueeze, transpose

@facebook-github-bot
Copy link
Contributor
facebook-github-bot commented May 13, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit f22d099 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@kshitij12345 kshitij12345 requested a review from anjali411 May 13, 2022 12:51
@kshitij12345 kshitij12345 marked this pull request as ready for review May 13, 2022 12:51
@@ -18467,6 +18466,12 @@ def __init__(
PythonRefInfo(
"_refs.permute",
torch_opinfo_name="permute",
skips=(
DecorateInfo(unittest.expectedFailure, 'TestCommon',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops forgot to add the error as comment

RuntimeError: "index_select_cuda" not implemented for 'ComplexHalf'

Copy link
Collaborator
@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice changes -- just add that comment, please

@kshitij12345
Copy link
Collaborator Author

@pytorchbot merge this please

@github-actions
Copy link
Contributor

Hey @kshitij12345.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@malfet
Copy link
Contributor
malfet commented May 14, 2022

@pytorchbot revert this please, as it caused torch_nn to fail with SIGIOT, see https://hud.pytorch.org/pytorch/pytorch/commit/fff560cb6e4232778cefe9b1a6ed78463b4b9e54

pytorchmergebot added a commit that referenced this pull request May 14, 2022
@malfet
Copy link
Contributor
malfet commented May 14, 2022

From the log it looks like it was triggered by SIGIOT while running test_reference_testing_linalg_tensorsolve_cuda_complex128:

2022-05-13T22:34:05.0889562Z   test_reference_testing_linalg_tensorsolve_cuda_complex128 (__main__.TestCommonCUDA) ... python: /opt/conda/conda-bld/magma-cuda113_1619629459349/work/interface_cuda/interface.cpp:901: void magma_queue_create_from_cuda_internal(magma_device_t, cudaStream_t, cublasHandle_t, cusparseHandle_t, magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null' failed.
2022-05-13T22:34:06.9287544Z Traceback (most recent call last):
2022-05-13T22:34:06.9288191Z   File "test/run_test.py", line 1072, in <module>
2022-05-13T22:34:06.9291356Z     main()
2022-05-13T22:34:06.9291930Z   File "test/run_test.py", line 1050, in main
2022-05-13T22:34:06.9295361Z     raise RuntimeError(err_message)
2022-05-13T22:34:06.9295734Z RuntimeError: test_ops failed! Received signal: SIGIOT

And since coredumps are not preserved as artifacts, one can get a backtrace by installing wheel package and running gdb as shown below:

$ gdb /opt/conda/bin/python core.936  -ex "bt"
GNU gdb (Ubuntu 8.2-0ubuntu1~16.04.1) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /opt/conda/bin/python...done.

warning: core file may not match specified executable file.
[New LWP 936]
[New LWP 939]
[New LWP 948]
[New LWP 942]
[New LWP 941]
[New LWP 940]
[New LWP 947]
[New LWP 943]
[New LWP 944]
[New LWP 945]
[New LWP 946]
[New LWP 949]
[New LWP 950]
[New LWP 951]
[New LWP 952]
[New LWP 953]
[New LWP 968]
[New LWP 969]
[New LWP 957]
[New LWP 961]
[New LWP 956]
[New LWP 967]
[New LWP 959]
[New LWP 954]
[New LWP 958]
[New LWP 1036]
[New LWP 963]
[New LWP 965]
[New LWP 966]
[New LWP 964]
[New LWP 970]
[New LWP 962]
[New LWP 1038]
[New LWP 960]
[New LWP 1037]
[New LWP 1040]
[New LWP 1041]
[New LWP 1039]
[New LWP 971]
[New LWP 955]
[New LWP 1042]

warning: Could not load shared library symbols for /usr/lib/x86_64-linux-gnu/libcuda.so.1.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/conda/bin/python test_ops.py -v --import-slow-tests --import-disabled-test'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f9eb38d9438 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f9eb47ad700 (LWP 936))]
#0  0x00007f9eb38d9438 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f9eb38db03a in __GI_abort () at abort.c:89
#2  0x00007f9eb38d1be7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x7f9d6788a94a "queue->dCarray__ != __null", file=file@entry=0x7f9d6788a628 "/opt/conda/conda-bld/magma-cuda113_1619629459349/work/interface_cuda/interface.cpp", 
    line=line@entry=901, 
    function=function@entry=0x7f9d6788ab00 <magma_queue_create_from_cuda_internal::__PRETTY_FUNCTION__> "void magma_queue_create_from_cuda_internal(magma_device_t, cudaStream_t, cublasHandle_t, cusparseHandle_t, magma_queue**, const char*, const char*, int)")
    at assert.c:92
#3  0x00007f9eb38d1c92 in __GI___assert_fail (assertion=0x7f9d6788a94a "queue->dCarray__ != __null", file=0x7f9d6788a628 "/opt/conda/conda-bld/magma-cuda113_1619629459349/work/interface_cuda/interface.cpp", line=901, 
    function=0x7f9d6788ab00 <magma_queue_create_from_cuda_internal::__PRETTY_FUNCTION__> "void magma_queue_create_from_cuda_internal(magma_device_t, cudaStream_t, cublasHandle_t, cusparseHandle_t, magma_queue**, const char*, const char*, int)") at assert.c:101
#4  0x00007f9d6752432c in magma_queue_create_from_cuda_internal () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda_linalg.so
#5  0x00007f9d674e2abb in at::native::MAGMAQueue::MAGMAQueue(long) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda_linalg.so
#6  0x00007f9d674d9dfe in at::native::lazy_linalg::lu_solve_trans_dispatch(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::native::TransposeType) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda_linalg.so
#7  0x00007f9e7936bcd7 in at::native::linalg_solve_out_info(at::Tensor&, at::Tensor&, at::Tensor const&, at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007f9e7936c202 in at::native::linalg_solve_out(at::Tensor const&, at::Tensor const&, at::Tensor&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007f9e72040b8d in at::(anonymous namespace)::(anonymous namespace)::wrapper_out_linalg_solve_out_out(at::Tensor const&, at::Tensor const&, at::Tensor&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cu.so
#10 0x00007f9e79d5f562 in at::_ops::linalg_solve_out::call(at::Tensor const&, at::Tensor const&, at::Tensor&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007f9e7935ced1 in at::native::linalg_solve(at::Tensor const&, at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#12 0x00007f9e720409f1 in at::(anonymous namespace)::(anonymous namespace)::wrapper__linalg_solve(at::Tensor const&, at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cu.so
#13 0x00007f9e72040a53 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper__linalg_solve>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cu.so
#14 0x00007f9e79d162f2 in at::Tensor c10::Dispatcher::redispatch<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) const [clone .isra.203] () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#15 0x00007f9e79d177e6 in at::_ops::linalg_solve::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#16 0x00007f9e7af6a670 in torch::autograd::VariableType::(anonymous namespace)::linalg_solve(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#17 0x00007f9e7af6b186 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::linalg_solve>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#18 0x00007f9e79d5d68f in at::_ops::linalg_solve::call(at::Tensor const&, at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#19 0x00007f9e794bcb3b in at::native::linalg_tensorsolve(at::Tensor const&, at::Tensor const&, c10::OptionalArrayRef<long>) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#20 0x00007f9e7a0af9bd in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::OptionalArrayRef<long>), &at::(anonymous namespace)::(anonymous namespace)::wrapper__linalg_tensorsolve>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::OptionalArrayRef<long> > >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::OptionalArrayRef<long>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::OptionalArrayRef<long>) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#21 0x00007f9e79ba37b3 in at::_ops::linalg_tensorsolve::call(at::Tensor const&, at::Tensor const&, c10::OptionalArrayRef<long>) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#22 0x00007f9e85ca4dc4 in torch::autograd::THPVariable_linalg_tensorsolve(_object*, _object*, _object*) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#23 0x000056234c6fd078 in cfunction_call_varargs (kwargs=0x7f9d7648a4b0, args=0x7f9d7610d320, func=0x7f9e60d2e500) at /home/builder/tkoch/workspace/python_1648536129212/work/Objects/call.c:755
#24 PyCFunction_Call (kwargs=0x7f9d7648a4b0, args=0x7f9d7610d320, func=0x7f9e60d2e500) at /home/builder/tkoch/workspace/python_1648536129212/work/Objects/call.c:786
#25 do_call_core (kwdict=0x7f9d7648a4b0, callargs=0x7f9d7610d320, func=0x7f9e60d2e500) at /home/builder/tkoch/workspace/python_1648536129212/work/Python/ceval.c:4641
#26 _PyEval_EvalFrameDefault (f=0x7f9d75189590, throwflag=<optimized out>) at /home/builder/tkoch/workspace/python_1648536129212/work/Python/ceval.c:3191
#27 0x000056234c64be85 in PyEval_EvalFrameEx (throwflag=0, f=0x7f9d75189590) at /home/builder/tkoch/workspace/python_1648536129212/work/Python/ceval.c:547
#28 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=0x7f9d71bb0748, kwargs=0x7f9d71bb0750, kwcount=2, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, 
    name=0x7f9eb45d11b0, qualname=0x7f9dd56bfa30) at /home/builder/tkoch/workspace/python_1648536129212/work/Python/ceval.c:3930
#29 0x000056234c64d83e in _PyFunction_FastCallDict (kwargs=<optimized out>, nargs=<optimized out>, args=0x7ffe43475540, func=<optimized out>) at /home/builder/tkoch/workspace/python_1648536129212/work/Objects/call.c:376
#30 _PyObject_FastCallDict (callable=<optimized out>, args=
8000
0x7ffe43475540, nargs=<optimized out>, kwargs=<optimized out>) at /home/builder/tkoch/workspace/python_1648536129212/work/Objects/call.c:98
#31 0x000056234c6b94bc in _PyObject_Call_Prepend (kwargs=0x7f9d7686fa50, args=0x7f9d70ccc410, obj=<optimized out>, callable=0x7f9d793de830) at /home/builder/tkoch/workspace/python_1648536129212/work/Objects/call.c:906
#32 slot_tp_call (self=<optimized out>, args=0x7f9d70ccc410, kwds=0x7f9d7686fa50) at /home/builder/tkoch/workspace/python_1648536129212/work/Objects/typeobject.c:6402
#33 0x000056234c64db94 in PyObject_Call (callable=0x7f9d7903cdd0, args=<optimized out>, kwargs=<optimized out>) at /home/builder/tkoch/workspace/python_1648536129212/work/Objects/call.c:245
#34 0x000056234c6f7c58 in do_call_core (kwdict=0x7f9d7686fa50, callargs=0x7f9d70ccc410, func=0x7f9d7903cdd0) at /home/builder/tkoch/workspace/python_1648536129212/work/Python/ceval.c:4645

Core file can be downloaded from https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/2321687135/1/coredumps-default-1-4-linux.4xlarge.nvidia.gpu/test/core.936 and offending whl package from https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/2321687135/1/linux-xenial-cuda11.3-py3.7-gcc7/artifacts.zip, which are mentioned among the artifacts lists at https://hud.pytorg.org/pr/77405

@kshitij12345
Copy link
Collaborator Author

This PR didn't touch the failing test. @lezcano have you seen such failure previously?

@lezcano
Copy link
Collaborator
lezcano commented May 14, 2022

Magma strikes again, this time with a new one cc @IvanYashchuk @xwang233 . It looks like a memory corruption or insufficient resources?

@malfet does the test fail consistnetly?

@kshitij12345
Copy link
Collaborator Author

@malfet looks like it was a one off issue.

Can you approve this again so that I can land it?

Thanks!

@anjali411
Copy link
Contributor

As discussed above, the failure looks unrelated (but recurrent). Should we disable that test while we figure out the issue? @lezcano

@lezcano
Copy link
Collaborator
lezcano commented May 17, 2022

Are these errors caused by this PR or are they coming from some flaky behaviour in CI?
If it's the latter one, I guess we can skip them for now, but we should look into what's causing them. Could it be something related to the removal of torch.solve? cc @IvanYashchuk who wrote the removal.
I wonder whether these still happen on top of #74046, which heavily simplifies the implementation of linalg.solve

@kshitij12345
Copy link
6D40 Collaborator Author

AFAIK, the failure isn't directly related to this PR as it doesn't touch that function or test. Seems to be a flaky case.

Will close this PR and open a new one with this branch for merging. (IIRC, reopening and remerging the same PR leads to issues internally).

pytorchmergebot pushed a commit that referenced this pull request May 18, 2022
Reland: #77405
Ref: #74537

Enable for `permute, split, split_with_sizes, select, ravel, reshape, reshape_as, unfold, squeeze, unsqueeze, transpose`
Pull Request resolved: #77656
Approved by: https://github.com/anjali411
facebook-github-bot pushed a commit that referenced this pull request May 20, 2022
Summary:
Reland: #77405
Ref: #74537

Enable for `permute, split, split_with_sizes, select, ravel, reshape, reshape_as, unfold, squeeze, unsqueeze, transpose`

Pull Request resolved: #77656
Approved by: https://github.com/anjali411

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/687ab97338c434f2d428325fd742ae7cd3042b53

Reviewed By: seemethere

Differential Revision: D36494122

Pulled By: seemethere

fbshipit-source-id: cb2803bf28c9be46547437c3b52e3dfb63b52336
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0