CUDA 9 + CUDNN 7 - seg fault #3081

esube · 2017-10-11T21:11:03Z

I have update my cuda to 9 and cudnn to 7003 and and checked out the latest pytorch and compiled with the new setup. When an exception happens, it produces a segmentation fault.

Previously (cuda 8 and cudnn 5.1), it used to handle exceptions with error message and crash gracefully. The segmentation fault core dump leaves the processes hanging around and that is unpleasant. Smaller network or smaller minibatch size works fine with same code.

apaszke · 2017-10-11T21:19:50Z

Can you please run your script under gdb (gdb --args <the command you usually use>, then type r)? Once the segfault happens type in bt and paste the output here.

esube · 2017-10-11T21:49:34Z

Thread 49 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffe973b0700 (LWP 23151)]
torch::autograd::InputBuffer::add (this=this@entry=0x7ffe973afb10, pos=pos@entry=0, var=...)
    at torch/csrc/autograd/input_buffer.cpp:17
17        if (!item.first.defined()) {
(gdb) bt
#0  torch::autograd::InputBuffer::add (this=this@entry=0x7ffe973afb10, pos=pos@entry=0, var=...)
    at torch/csrc/autograd/input_buffer.cpp:17
#1  0x00007fffb1951fad in torch::autograd::Engine::evaluate_function (
    this=this@entry=0x7fffb2b95d00 <engine>, task=...) at torch/csrc/autograd/engine.cpp:268
#2  0x00007fffb195354e in torch::autograd::Engine::thread_main (this=0x7fffb2b95d00 <engine>, 
    graph_task=0x0) at torch/csrc/autograd/engine.cpp:144
#3  0x00007fffb1950382 in torch::autograd::Engine::thread_init (
    this=this@entry=0x7fffb2b95d00 <engine>, device=device@entry=1)
    at torch/csrc/autograd/engine.cpp:121
#4  0x00007fffb19728ea in torch::autograd::python::PythonEngine::thread_init (
    this=0x7fffb2b95d00 <engine>, device=1) at torch/csrc/autograd/python_engine.cpp:28
#5  0x00007ffff3967c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7bc16ba in start_thread (arg=0x7ffe973b0700) at pthread_create.c:333
#7  0x00007ffff78f73dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

apaszke · 2017-10-12T15:40:01Z

Hmm that's still not enough information to solve the issue. Can you please prepare a script that would let us reproduce the problem?

esube · 2017-10-12T18:23:57Z

The code base is large to share. But, I just noticed that a medium network (ResNet34 base network with various modifications) produced the same segmentation fault although the GPUs (2 Titan X) didn't run out of memory. The strange thing is: switching to ResNet50 (used as feature extractor) instead of ResNet34 dies earlier. However, ResNet18 trains and finishes without a hitch. All the three networks use same code base, same dataset, same DataParallel (2 GPUs). It appears that some exceptions are not being handled.

The machine has an Intel Xeon CPU and Titan X GPUs. It could be related to #3089 so I am changing the out of memory error title on this one. For me, pytorch + cuda 8 + cudnn v5.1 was working without this segfault ever. I switch to cuda 9 + cudnn v7 this week and checked pytorch and rebuild it and this started happening.

[Update]

Also, I noticed (on htop) that quite a lot more threads are spawned in the new setup (cuda 9 + cudnn 7 + latest pytorch) than my previous setup.
I tried to run all the three networks on single GPU without DataParallel and to my surprise, all the three are running just fine without the segfault. My data for each minibatch is large and the input size varies randomly in each minibatch using adaptive pooling. Although, the GPU memory comes closer to full capacity on some minibatches, it runs fine on signle GPU that used to throw cuda memory error on two GPUs!

So, at this point, it appears that the segfault is related to DataParallel and I think, it is not related to out of memory or whether the platform was ppc64 (the case of #3089 ) or intel (my case). So, we can merge these two issues as one as the problem is similar.

lightChaserX · 2017-10-14T07:41:25Z

Same error: https://github.com/mingyuliutw/UNIT/issues/13

When the iteration is about 100, the training will end and encounter a segmentation fault problem.

i.e.,

Iteration: 00000092/02000000
Iteration: 00000093/02000000
Iteration: 00000094/02000000
Iteration: 00000095/02000000
Iteration: 00000096/02000000
Iteration: 00000097/02000000
Iteration: 00000098/02000000
Iteration: 00000099/02000000
Iteration: 00000100/02000000
Segmentation fault

the stack-trace information
``
Iteration: 00000101/02000000
Iteration: 00000102/02000000
Iteration: 00000103/02000000
Iteration: 00000104/02000000
Iteration: 00000105/02000000
Iteration: 00000106/02000000
Iteration: 00000107/02000000
Iteration: 00000108/02000000

Program received signal SIGSEGV, Segmentation fault.
0x0000555555632cb0 in ?? ()
(gdb) where
#0 0x0000555555632cb0 in ?? ()
#1 0x0000555555632d95 in ?? ()
#2 0x0000555555631f45 in ?? ()
#3 0x0000555555629b64 in _PyObject_GC_Malloc ()
#4 0x000055555562962d in _PyObject_GC_New ()
#5 0x000055555567d991 in ?? ()
#6 0x000055555566b87f in PyObject_GetIter ()
#7 0x000055555564ff90 in PyEval_EvalFrameEx ()
#8 0x000055555564d285 in PyEval_EvalCodeEx ()
#9 0x000055555566a08e in ?? ()
#10 0x000055555563b983 in PyObject_Call ()
#11 0x0000555555659460 in PyEval_CallObjectWithKeywords ()
#12 0x00007fff8f37becd in THPFunction_apply (cls=0x5555569afc80, _inputs=0x7ffff342b050) at torch/csrc/autograd/python_function.cpp:721
#13 0x000055555564f1aa in PyEval_EvalFrameEx ()
#14 0x000055555564d285 in PyEval_EvalCodeEx ()
#15 0x0000555555654d49 in PyEval_EvalFrameEx ()
#16 0x000055555564d285 in PyEval_EvalCodeEx ()
#17 0x000055555566a248 in ?? ()
#18 0x000055555563b983 in PyObject_Call ()
#19 0x00005555556516bd in PyEval_EvalFrameEx ()
#20 0x000055555564d285 in PyEval_EvalCodeEx ()
#21 0x000055555566a08e in ?? ()
#22 0x000055555563b983 in PyObject_Call ()
#23 0x00005555556805de in ?? ()
#24 0x000055555563b983 in PyObject_Call ()
#25 0x00005555556de6a7 in ?? ()
#26 0x000055555563b983 in PyObject_Call ()
#27 0x0000555555654c5f in PyEval_EvalFrameEx ()
#28 0x000055555564d285 in PyEval_EvalCodeEx ()
#29 0x000055555566a248 in ?? ()
#30 0x000055555563b983 in PyObject_Call ()
#31 0x00005555556516bd in PyEval_EvalFrameEx ()
#32 0x000055555564d285 in PyEval_EvalCodeEx ()
#33 0x000055555566a08e in ?? ()
#34 0x000055555563b983 in PyObject_Call ()
#35 0x00005555556805de in ?? ()
#36 0x000055555563b983 in PyObject_Call ()
#37 0x00005555556de6a7 in ?? ()
#38 0x000055555563b983 in PyObject_Call ()
#39 0x0000555555654c5f in PyEval_EvalFrameEx ()
#40 0x000055555564d285 in PyEval_EvalCodeEx ()
#41 0x000055555566a248 in ?? ()
#42 0x000055555563b983 in PyObject_Call ()
#43 0x00005555556516bd in PyEval_EvalFrameEx ()
#44 0x000055555564d285 in PyEval_EvalCodeEx ()
#45 0x000055555566a08e in ?? ()
#46 0x000055555563b983 in PyObject_Call ()
#47 0x00005555556805de in ?? ()
#48 0x000055555563b983 in PyObject_Call ()
#49 0x00005555556de6a7 in ?? ()
#50 0x000055555563b983 in PyObject_Call ()
#51 0x0000555555654c5f in PyEval_EvalFrameEx ()
#52 0x000055555564d285 in PyEval_EvalCodeEx ()
#53 0x000055555566a248 in ?? ()
#54 0x000055555563b983 in PyObject_Call ()
---Type to continue, or q to quit---return
#55 0x00005555556516bd in PyEval_EvalFrameEx ()
#56 0x000055555564d285 in PyEval_EvalCodeEx ()
#57 0x000055555566a08e in ?? ()
#58 0x000055555563b983 in PyObject_Call ()
#59 0x00005555556805de in ?? ()
#60 0x000055555563b983 in PyObject_Call ()
#61 0x00005555556de6a7 in ?? ()
#62 0x000055555563b983 in PyObject_Call ()
#63 0x0000555555654c5f in PyEval_EvalFrameEx ()
#64 0x000055555564d285 in PyEval_EvalCodeEx ()
#65 0x000055555566a248 in ?? ()
#66 0x000055555563b983 in PyObject_Call ()
#67 0x00005555556516bd in PyEval_EvalFrameEx ()
#68 0x000055555564d285 in PyEval_EvalCodeEx ()
#69 0x000055555566a08e in ?? ()
#70 0x000055555563b983 in PyObject_Call ()
#71 0x00005555556805de in ?? ()
#72 0x000055555563b983 in PyObject_Call ()
#73 0x00005555556de6a7 in ?? ()
#74 0x000055555563b983 in PyObject_Call ()
#75 0x0000555555654c5f in PyEval_EvalFrameEx ()
#76 0x0000555555654a4f in PyEval_EvalFrameEx ()
#77 0x000055555564d285 in PyEval_EvalCodeEx ()
#78 0x000055555565555b in PyEval_EvalFrameEx ()
#79 0x000055555564d285 in PyEval_EvalCodeEx ()
#80 0x000055555564d029 in PyEval_EvalCode ()
#81 0x000055555567d42f in ?? ()
#82 0x00005555556783a2 in PyRun_FileExFlags ()
#83 0x0000555555677eee in PyRun_SimpleFileExFlags ()
#84 0x0000555555628ee1 in Py_Main ()
#85 0x00007ffff6f14b45 in __libc_start_main (main=0x555555628810

, argc=8, argv=0x7fffffffeba8, init=, fini=, rtld_fini=, stack_end=0x7fffffffeb98) at libc-start.c:287
#86 0x000055555562870a in _start ()

apaszke · 2017-10-14T11:52:50Z

I understand that your codebase is large, but can you please try to write up a small sample that we could use to reproduce the problem? Just a tiny standalone snippet (e.g. create a model, wrap in data parallel, get random data, run and crash). It's really hard for us to debug issues via comments.

lightChaserX · 2017-10-16T11:20:06Z

@apaszke Thanks a lot. I have fixed this issue. This issue may be caused by dependency bug. I reinstalled python and all the dependencies using conda and set the version of cudnn to 6 (previously 5.1). There are no issues left.

apaszke · 2017-10-16T11:33:50Z

@JhonsonWanger yeah, 5 is no longer supported

mmderakhshani · 2018-02-22T19:52:14Z

Hi there,
I have got the same error when forwarding a batch of input to my module. Here is the output of my gdb:

Thread 73 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffdd21c4700 (LWP 15179)]
0x00007fffa22ee8d5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0 0x00007fffa22ee8d5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1 0x00007fffa243e914 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007fffa23dae80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007ffff7bc16ba in start_thread (arg=0x7ffdd21c4700) at pthread_create.c:333
#4 0x00007ffff78f741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Is this error related to my cuda and cudnn?

esube mentioned this issue Oct 12, 2017

Segfault when using data parallel on PPC64 #3089

Closed

esube changed the title ~~CUDA 9 + CUDNN 7 - out of memory seg fault~~ CUDA 9 + CUDNN 7 - seg fault Oct 12, 2017

esube closed this as completed Oct 18, 2017

aaronzira mentioned this issue Nov 7, 2017

Segmentation fault during training (Volta, others) SeanNaren/deepspeech.pytorch#172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA 9 + CUDNN 7 - seg fault #3081

CUDA 9 + CUDNN 7 - seg fault #3081

CUDA 9 + CUDNN 7 - seg fault #3081

CUDA 9 + CUDNN 7 - seg fault #3081

Comments