-
Notifications
You must be signed in to change notification settings - Fork 24.1k
core dumped (ver1.0.0) #16183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello, the same Problem occurs on my system. I added the stacktrace, system description and a list showing the conda environment used. As can be seen in the stacktrace the exception is thrown in Setting:
Conda environment:
Stacktrace: at ../csu/libc-start.c:310 #87 0x0000555555717e0e in _start () at ../sysdeps/x86_64/elf/start.S:103 (gdb) |
The problem is with mkl-dnn version 0.14.0 that is bundled with the PyPy package. This is an issue that was resolved in later versions, see uxlfoundation/oneDNN#215 and uxlfoundation/oneDNN@a5f6077. Can the maintainers please update the PyPy package to include a more recent build of mkl-dnn, Thank you! |
@vvishal this is fixed in nightly builds and also v1.0.1 |
Will,
Thank you very much for your prompt attention. Installing via pip install
-U still says 1.0.0 is the latest version. Do we need to do anything
different to get v1.0.1?
Best,
Vishal
…On Mon, Jan 28, 2019 at 10:40 AM Will Feng ***@***.***> wrote:
@vvishal <https://github.com/vvishal> this is fixed in nightly builds and
also v1.0.1
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16183 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADD4XRmgYicFEgW-6I0CyfmgIW1PM58uks5vH0QUgaJpZM4aI85s>
.
|
@vvishal v1.0.1 is not released yet, you can try out the nightly version: |
@yf225 -- Thank you so much for all the comments so far, I have been having the same problem and this thread helped a lot. Not sure if this is the right place to report this, but I still have the same kind of MKL-DNN related segmentation fault even in the nightly build for CUDA 10. Infos: Ubuntu18.04/CUDA10/nightlybuild/conda install. The gdb stacktrace points to a crash in libcaffe2.so's MKL-DNN functions targetting AVX512 Skylake-Server instructions. Before I was getting the exact same trace as @strobelTha Going to try recompiling everything from source now to see if that helps matters. |
Yes, turns out you need HEAD of mkldnn - none of the releases including
0.17.2 have the full fix. The requisite patch went in around Dec 2018.
As a temporary work around, you can clone and build mkldnn and simply
replace the bundled shared libraries with the ones you build - has been
working for me so far, caveat emptor. :-)
Vishal
…On Wed, Jan 30, 2019 at 6:34 AM ehtom ***@***.***> wrote:
@yf225 <https://github.com/yf225> -- Thank you so much for all the
comments so far, I have been having the same problem and this thread helped
a lot.
Not sure if this is the right place to report this, but I still have the
same kind of MKL-DNN related segmentation fault even in the nightly build
for CUDA 10.
Infos: Ubuntu18.04/CUDA10/nightlybuild/conda install.
The gdb stacktrace points to a crash in libcaffe2.so's MKL-DNN functions
targetting AVX512 Skylake-Server instructions. Before I was getting the
exact same trace as @strobelTha <https://github.com/strobelTha>
Going to try recompiling everything from source now to see if that helps
matters.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16183 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADD4XY307IkFpZR9MT9oxQETLdPgWi0vks5vIa2GgaJpZM4aI85s>
.
|
@vvishal, could you please point to the patch you are referring to? If you are right and the issue is reproducible in PyTorch v1.0.1 we might want to backport that patch and release MKL-DNN v.0.17.3. |
@vvishal, thanks! Your solution seems to work for me as well. I replaced the current branches of ideep and mkl-dnn in third-party/ with their current master branch and compiled from source. @vpirogov, I am not sure which update exactly fixed it in mkl-dnn but from the look at its history it has quite a number of recent AVX512 updates (even since December). |
It's on line 216 in src/cpu/xbyak_util.h in the mkl dnn sources, the
correct line should read:
cores_sharing_data_cache[data_cache_levels] =
(std::max)(actual_logical_cores / smt_width, 1u);
If the max() is not done, you get zero under some circumstances. This
results in getCacheSize() causing a divide by zero and that can get called
from multiple places leading to a slightly different stack trace, but
essentially the same problem. I think this line was added in commit
67393d999591c88f03d5b09d545b1bf19c46f836.
Thanks!
Vishal
…On Wed, Jan 30, 2019 at 8:38 AM Vadim Pirogov ***@***.***> wrote:
@vvishal <https://github.com/vvishal>, could you please point to the
patch you are referring to? If you are right and the issue is reproducible
in PyTorch v1.0.1 we might want to backport that patch and release MKL-DNN
v.0.17.3.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16183 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADD4XZ3QJw56BKckibs1m2sxzPLZf-_Wks5vIcqLgaJpZM4aI85s>
.
|
Should be fixed by #16653 |
Hello, big thanks to @vvishal building with latest mkl-dnn worked for me. One can easily do this by cloning the pytorch repo, navigating to the mkl-dnn subfolder and checking out the latest version. After that one can build pytorch as usual. The needed commands (from the cloned pytorch repos main folder):
|
The mkl_dnn submodules is already at HEAD=0.17.3. |
- See http://nvbugs/2470530 and http://nvbugs/2506132 and pytorch/pytorch#16183
In my first trial of ver1.0.0, I encountered core dumped.
In my setting is as below:
When I installed torch==0.4.1, it worked.
How can I correctly install and use ver1.0.0?
The text was updated successfully, but these errors were encountered: