8000 core dumped (ver1.0.0) · Issue #16183 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

core dumped (ver1.0.0) #16183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tagucci opened this issue Jan 19, 2019 · 15 comments
Closed

core dumped (ver1.0.0) #16183

tagucci opened this issue Jan 19, 2019 · 15 comments
Labels
module: crash Problem manifests as a hard crash, as opposed to a RuntimeError

Comments

@tagucci
Copy link
tagucci commented Jan 19, 2019

In my first trial of ver1.0.0, I encountered core dumped.

$ ipython
In [1]: import torch
[1]    45369 floating point exception (core dumped)  ipython

In my setting is as below:

  • PyTorch Version: 1.0
  • OS : Ubuntu 16.04.5 LTS
  • How you installed PyTorch: pip
  • Python version: 3.6.5
  • CUDA/cuDNN version: CUDA9.0/cudnn7.4.1
  • GPU model: Tesla V100

When I installed torch==0.4.1, it worked.
How can I correctly install and use ver1.0.0?

@zou3519 zou3519 added the module: crash Problem manifests as a hard crash, as opposed to a RuntimeError label Jan 22, 2019
@strobelTha
Copy link

Hello,

the same Problem occurs on my system. I added the stacktrace, system description and a list showing the conda environment used. As can be seen in the stacktrace the exception is thrown in _GLOBAL__sub_I_jit_avx512_common_conv_kernel.cpp. Does anyone else have this problem or knows how to solve it?

Setting:

  • PyTorch Version: 1.0
  • OS : Ubuntu 18.04 LTS
  • How you installed PyTorch: conda
  • Python version: 3.7.2
  • CUDA/cuDNN version: CUDA9.0/cudnn7.4.1
  • GPU model: Tesla V100

Conda environment:

Name Version Build Channel
blas 1.0 mkl
ca-certificates 2018.12.5 0
certifi 2018.11.29 py37_0
cffi 1.11.5 py37he75722e_1
freetype 2.9.1 h8a8886c_1
intel-openmp 2019.1 144
jpeg 9b h024ee3a_2
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 8.2.0 hdf63c60_1
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.36 hbc83047_0
libstdcxx-ng 8.2.0 hdf63c60_1
libtiff 4.0.10 h2733197_1001
mkl 2019.1 144
mkl_fft 1.0.10 py37ha843d7b_0
mkl_random 1.0.2 py37hd81dba3_0
ncurses 6.1 he6710b0_1
ninja 1.8.2 py37h6bb024c_1
numpy 1.15.4 py37h7e9f1db_0
numpy-base 1.15.4 py37hde5b4d6_0
olefile 0.46 py37_0
openssl 1.1.1a h7b6447c_0
pillow 5.4.1 py37h34e0f95_0
pip 18.1 py37_0
pycparser 2.19 py37_0
python 3.7.2 h0371630_0
pytorch 1.0.0 py3.7_cuda9.0.176_cudnn7.4.1_1 pytorch
readline 7.0 h7b6447c_5
setuptools 40.6.3 py37_0
six 1.12.0 py37_0
sqlite 3.26.0 h7b6447c_0
tk 8.6.8 hbc83047_0
torchvision 0.2.1 py_2 pytorch
wheel 0.32.3 py37_0
xz 5.2.4 h14c3975_4
zlib 1.2.11 h7b6447c_3

Stacktrace:
Program received signal SIGFPE, Arithmetic exception.
0x00007fffb7e179c0 in _GLOBAL__sub_I_jit_avx512_common_conv_kernel.cpp () from /home/leo/miniconda3/envs/torch-dbg/lib/python3.6/site-packages/torch/lib/libmkldnn.so.0
(gdb) bt
#0 0x00007fffb7e179c0 in _GLOBAL__sub_I_jit_avx512_common_conv_kernel.cpp () from /home/leo/miniconda3/envs/torch-dbg/lib/python3.6/site-packages/torch/lib/libmkldnn.so.0
#1 0x00007ffff7de5733 in call_init (env=0x7fffffffe208, argv=0x7fffffffe1f8, argc=1, l=) at dl-init.c:72
#2 _dl_init (main_map=main_map@entry=0x555555a04090, argc=1, argv=0x7fffffffe1f8, env=0x7fffffffe208) at dl-init.c:119
#3 0x00007ffff7dea1ff in dl_open_worker (a=a@entry=0x7fffffffba80) at dl-open.c:522
#4 0x00007ffff792c2df in __GI__dl_catch_exception (exception=0x7fffffffba60, operate=0x7ffff7de9dc0 <dl_open_worker>, args=0x7fffffffba80) at dl-error-skeleton.c:196
#5 0x00007ffff7de97ca in _dl_open (file=0x7ffff68d4cb0 "/home/leo/miniconda3/envs/torch-dbg/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so", mode=-2147483391,
caller_dlopen=0x55555574052a <_PyImport_FindSharedFuncptr+138>, nsid=, argc=1, argv=, env=0x7fffffffe208) at dl-open.c:605
#6 0x00007ffff75c1f96 in dlopen_doit (a=a@entry=0x7fffffffbcb0) at dlopen.c:66
#7 0x00007ffff792c2df in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffbc50, operate=0x7ffff75c1f40 <dlopen_doit>, args=0x7fffffffbcb0) at dl-error-skeleton.c:196
#8 0x00007ffff792c36f in __GI__dl_catch_error (objname=0x55555592eac0, errstring=0x55555592eac8, mallocedp=0x55555592eab8, operate=, args=) at dl-error-skeleton.c:215
#9 0x00007ffff75c2735 in _dlerror_run (operate=operate@entry=0x7ffff75c1f40 <dlopen_doit>, args=args@entry=0x7fffffffbcb0) at dlerror.c:162
#10 0x00007ffff75c2051 in __dlopen (file=, mode=) at dlopen.c:87
#11 0x000055555574052a in _PyImport_FindSharedFuncptr () at /tmp/build/80754af9/python_1546130271559/work/Python/dynload_shlib.c:95
#12 0x000055555576b2f0 in _PyImport_LoadDynamicModuleWithSpec () at /tmp/build/80754af9/python_1546130271559/work/Python/importdl.c:129
#13 0x000055555576b540 in _imp_create_dynamic_impl.isra.12 (file=0x0, spec=0x7ffff682d0b8) at /tmp/build/80754af9/python_1546130271559/work/Python/import.c:1994
#14 _imp_create_dynamic () at /tmp/build/80754af9/python_1546130271559/work/Python/clinic/import.c.h:289
#15 0x0000555555668711 in PyCFunction_Call () at /tmp/build/80754af9/python_1546130271559/work/Objects/methodobject.c:114
#16 0x00005555557164ad in do_call_core (kwdict=0x7ffff682eee8, callargs=0x7ffff6827e48, func=0x7ffff6bd0ea0) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:5116
#17 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:3404
#18 0x00005555556e58e4 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4166
#19 0x00005555556e6771 in fast_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4992
#20 0x00005555556ec505 in call_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4872
#21 0x000055555571138a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:3335
#22 0x00005555556e653b in _PyFunction_FastCall (globals=, nargs=2, args=, co=) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4933
#23 fast_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4968
#24 0x00005555556ec505 in call_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4872
#25 0x000055555571138a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:3335
#26 0x00005555556e653b in _PyFunction_FastCall (globals=, nargs=1, args=, co=) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4933
#27 fast_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4968
#28 0x00005555556ec505 in call_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4872
#29 0x000055555571138a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:3335
#30 0x00005555556e653b in _PyFunction_FastCall (globals=, nargs=1, args=, co=) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4933
#31 fast_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4968
#32 0x00005555556ec505 in call_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4872
#33 0x000055555571138a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:3335
#34 0x00005555556e653b in _PyFunction_FastCall (globals=, nargs=2, args=, co=) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4933
#35 fast_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4968
#36 0x00005555556ec505 in call_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4872
#37 0x000055555571138a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:3335
#38 0x00005555556e6bab in _PyFunction_FastCall (globals=, nargs=2, args=, co=) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4933
#39 _PyFunction_FastCallDict () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:5035
#40 0x0000555555665b0f in _PyObject_FastCallDict () at /tmp/build/80754af9/python_1546130271559/work/Objects/abstract.c:2310
#41 0x00005555556a7810 in _PyObject_CallMethodIdObjArgs () at /tmp/build/80754af9/python_1546130271559/work/Objects/abstract.c:2796
#42 0x000055555565cb10 in PyImport_ImportModuleLevelObject () at /tmp/build/80754af9/python_1546130271559/work/Python/import.c:1578
#43 0x0000555555713a8b in import_name (level=0x555555892aa0 <small_ints+160>, fromlist=0x7ffff69f1198, name=0x7ffff69edf30, f=0x555555984fa8) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:5245
#44 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:2899
#45 0x00005555556e7289 in _PyEval_EvalCodeWithName (qualname=0x0, name=, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2, kwcount=, kwargs=0x0, kwnames=0x0, argcount=0,
args=0x0, locals=0x7ffff6ad7480, globals=0x7ffff6ad7480, _co=0x7ffff6ad8db0) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4166
#46 PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4187
#47 0x00005555556e801c in PyEval_EvalCode (co=co@entry=0x7ffff6ad8db0, globals=globals@entry=0x7ffff6ad7480, locals=locals@entry=0x7ffff6ad7480)
at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:731
#48 0x000055555570ec8b in builtin_exec_impl.isra.11 (locals=0x7ffff6ad7480, globals=0x7ffff6ad7480, source=0x7ffff6ad8db0) at /tmp/build/80754af9/python_1546130271559/work/Python/bltinmodule.c:983
#49 builtin_exec () at /tmp/build/80754af9/python_1546130271559/work/Python/clinic/bltinmodule.c.h:283
---Type to continue, or q to quit---
#50 0x0000555555668711 in PyCFunction_Call () at /tmp/build/80754af9/python_1546130271559/work/Objects/methodobject.c:114
#51 0x00005555557164ad in do_call_core (kwdict=0x7ffff6ad7558, callargs=0x7ffff6ad9208, func=0x7ffff7e63990) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:5116
#52 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:3404
#53 0x00005555556e58e4 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4166
#54 0x00005555556e6771 in fast_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4992
#55 0x00005555556ec505 in call_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4872
#56 0x000055555571138a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:3335
#57 0x00005555556e653b in _PyFunction_FastCall (globals=, nargs=2, args=, co=) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4933
#58 fast_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4968
#59 0x00005555556ec505 in call_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4872
#60 0x000055555571138a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:3335
#61 0x00005555556e653b in _PyFunction_FastCall (globals=, nargs=1, args=, co=) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4933
#62 fast_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4968
#63 0x00005555556ec505 in call_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4872
#64 0x000055555571138a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:3335
#65 0x00005555556e653b in _PyFunction_FastCall (globals=, nargs=2, args=, co=) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4933
#66 fast_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4968
#67 0x00005555556ec505 in call_function () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4872
#68 0x000055555571138a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:3335
#69 0x00005555556e6bab in _PyFunction_FastCall (globals=, nargs=2, args=, co=) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4933
#70 _PyFunction_FastCallDict () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:5035
#71 0x0000555555665b0f in _PyObject_FastCallDict () at /tmp/build/80754af9/python_1546130271559/work/Objects/abstract.c:2310
#72 0x00005555556a7810 in _PyObject_CallMethodIdObjArgs () at /tmp/build/80754af9/python_1546130271559/work/Objects/abstract.c:2796
#73 0x000055555565cb10 in PyImport_ImportModuleLevelObject () at /tmp/build/80754af9/python_1546130271559/work/Python/import.c:1578
#74 0x0000555555713a8b in import_name (level=0x555555892aa0 <small_ints+160>, fromlist=0x55555584bb30 <_Py_NoneStruct>, name=0x7ffff6acfed8, f=0x7ffff7e45a38)
at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:5245
#75 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:2899
#76 0x00005555556e7289 in _PyEval_EvalCodeWithName (qualname=0x0, name=, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2, kwcount=, kwargs=0x0, kwnames=0x0, argcount=0,
args=0x0, locals=0x7ffff6b94120, globals=0x7ffff6b94120, _co=0x7ffff6b0eed0) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4166
#77 PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:4187
#78 0x00005555556e801c in PyEval_EvalCode (co=co@entry=0x7ffff6b0eed0, globals=globals@entry=0x7ffff6b94120, locals=locals@entry=0x7ffff6b94120)
at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:731
#79 0x000055555576a3c4 in run_mod () at /tmp/build/80754af9/python_1546130271559/work/Python/pythonrun.c:1025
#80 0x00005555556321e6 in PyRun_InteractiveOneObjectEx (fp=fp@entry=0x7ffff7bb0a00 <IO_2_1_stdin>, filename=filename@entry=0x7ffff6b50998, flags=flags@entry=0x7fffffffdfec)
at /tmp/build/80754af9/python_1546130271559/work/Python/pythonrun.c:246
#81 0x000055555563239c in PyRun_InteractiveLoopFlags (fp=fp@entry=0x7ffff7bb0a00 <IO_2_1_stdin>, filename_str=filename_str@entry=0x5555557a224e "", flags=flags@entry=0x7fffffffdfec)
at /tmp/build/80754af9/python_1546130271559/work/Python/pythonrun.c:114
#82 0x000055555563243c in PyRun_AnyFileExFlags (fp=fp@entry=0x7ffff7bb0a00 <IO_2_1_stdin>, filename=0x5555557a224e "", closeit=closeit@entry=0, flags=flags@entry=0x7fffffffdfec)
at /tmp/build/80754af9/python_1546130271559/work/Python/pythonrun.c:75
#83 0x0000555555634237 in run_file (p_cf=0x7fffffffdfec, filename=, fp=0x7ffff7bb0a00 <IO_2_1_stdin>) at /tmp/build/80754af9/python_1546130271559/work/Modules/main.c:340
#84 Py_Main (argc=1, argv=0x5555558a8260) at /tmp/build/80754af9/python_1546130271559/work/Modules/main.c:811
#85 0x000055555563702e in main () at /tmp/build/80754af9/python_1546130271559/work/Programs/python.c:69
#86 0x00007ffff77e6b97 in __libc_start_main (main=0x555555636f40

, argc=1, argv=0x7fffffffe1f8, init=, fini=, rtld_fini=, stack_end=0x7fffffffe1e8)
at ../csu/libc-start.c:310
#87 0x0000555555717e0e in _start () at ../sysdeps/x86_64/elf/start.S:103
(gdb)

@vvishal
Copy link
vvishal commented Jan 26, 2019

The problem is with mkl-dnn version 0.14.0 that is bundled with the PyPy package. This is an issue that was resolved in later versions, see uxlfoundation/oneDNN#215 and uxlfoundation/oneDNN@a5f6077. Can the maintainers please update the PyPy package to include a more recent build of mkl-dnn, Thank you!

@yf225
Copy link
Contributor
yf225 commented Jan 27, 2019

cc @soumith @pjh5

@yf225
Copy link
Contributor
yf225 commented Jan 28, 2019

@vvishal this is fixed in nightly builds and also v1.0.1

@yf225 yf225 closed this as completed Jan 28, 2019
@vvishal
Copy link
vvishal commented Jan 28, 2019 via email

@yf225
Copy link
Contributor
yf225 commented Jan 28, 2019

@vvishal v1.0.1 is not released yet, you can try out the nightly version: pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu90/torch_nightly.html

@ehtom
Copy link
ehtom commented Jan 30, 2019

@yf225 -- Thank you so much for all the comments so far, I have been having the same problem and this thread helped a lot.

Not sure if this is the right place to report this, but I still have the same kind of MKL-DNN related segmentation fault even in the nightly build for CUDA 10.

Infos: Ubuntu18.04/CUDA10/nightlybuild/conda install.

The gdb stacktrace points to a crash in libcaffe2.so's MKL-DNN functions targetting AVX512 Skylake-Server instructions. Before I was getting the exact same trace as @strobelTha

Going to try recompiling everything from source now to see if that helps matters.

@vvishal
Copy link
vvishal commented Jan 30, 2019 via email

@vpirogov
Copy link

@vvishal, could you please point to the patch you are referring to? If you are right and the issue is reproducible in PyTorch v1.0.1 we might want to backport that patch and release MKL-DNN v.0.17.3.

@ehtom
Copy link
ehtom commented Jan 30, 2019

@vvishal, thanks!

Your solution seems to work for me as well. I replaced the current branches of ideep and mkl-dnn in third-party/ with their current master branch and compiled from source.

@vpirogov, I am not sure which update exactly fixed it in mkl-dnn but from the look at its history it has quite a number of recent AVX512 updates (even since December).

@vvishal
Copy link
vvishal commented Jan 31, 2019 via email

@soumith soumith reopened this Jan 31, 2019
facebook-github-bot pushed a commit that referenced this issue Feb 1, 2019
)

Summary:
Upgrade mkl-dnn to 0.17.3 to fix core dump issue in #16183
Pull Request resolved: #16653

Differential Revision: D13918278

Pulled By: soumith

fbshipit-source-id: b9c09c50ef188b4099966216e155c9f3f2542276
@zou3519
Copy link
Contributor
zou3519 commented Feb 4, 2019

Should be fixed by #16653

@zou3519 zou3519 closed this as completed Feb 4, 2019
@mitar
Copy link
mitar commented Feb 12, 2019

I think that #16653 was reverted by #16660. So is this fixed or not?

@strobelTha
Copy link

Hello,

big thanks to @vvishal building with latest mkl-dnn worked for me. One can easily do this by cloning the pytorch repo, navigating to the mkl-dnn subfolder and checking out the latest version. After that one can build pytorch as usual.

The needed commands (from the cloned pytorch repos main folder):

cd third_party/ideep/mkl-dnn
git checkout origin/master
cd ../../..
python setup.py install

facebook-github-bot pushed a commit that referenced this issue Feb 15, 2019
Summary:
Upgrade mkl-dnn to 0.17.3 to fix core dump issue in #16183
Pull Request resolved: #17107

Differential Revision: D14097600

Pulled By: yinghai

fbshipit-source-id: 2baa44e211ce37fbdf01585344c98745f5ba008c
@zhanwenchen
Copy link
zhanwenchen commented Mar 26, 2019

Hello,

big thanks to @vvishal building with latest mkl-dnn worked for me. One can easily do this by cloning the pytorch repo, navigating to the mkl-dnn subfolder and checking out the latest version. After that one can build pytorch as usual.

The needed commands (from the cloned pytorch repos main folder):

cd third_party/ideep/mkl-dnn
git checkout origin/master
cd ../../..
python setup.py install

The mkl_dnn submodules is already at HEAD=0.17.3. git checkout origin/master changes nothing. The only solution for me is to download the mkl_dnn 0.18.1 and just pasted the folder into it. It has to be 0.18.1: 0.17.4 causes the same error.

kaixih pushed a commit to kaixih/tensorflow that referenced this issue Jun 5, 2019
- See http://nvbugs/2470530 and http://nvbugs/2506132
  and pytorch/pytorch#16183
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: crash Problem manifests as a hard crash, as opposed to a RuntimeError
Projects
None yet
Development

No branches or pull requests

10 participants
0