10000 CUDA 9 + CUDNN 7 - seg fault · Issue #3081 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

CUDA 9 + CUDNN 7 - seg fault #3081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
esube opened this issue Oct 11, 2017 · 9 comments
Closed

CUDA 9 + CUDNN 7 - seg fault #3081

esube opened this issue Oct 11, 2017 · 9 comments

Comments

@esube
Copy link
esube commented Oct 11, 2017

I have update my cuda to 9 and cudnn to 7003 and and checked out the latest pytorch and compiled with the new setup. When an exception happens, it produces a segmentation fault.

Previously (cuda 8 and cudnn 5.1), it used to handle exceptions with error message and crash gracefully. The segmentation fault core dump leaves the processes hanging around and that is unpleasant. Smaller network or smaller minibatch size works fine with same code.

@apaszke
Copy link
Contributor
apaszke commented Oct 11, 2017

Can you please run your script under gdb (gdb --args <the command you usually use>, then type r)? Once the segfault happens type in bt and paste the output here.

@esube
Copy link
Author
esube commented Oct 11, 2017
Thread 49 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffe973b0700 (LWP 23151)]
torch::autograd::InputBuffer::add (this=this@entry=0x7ffe973afb10, pos=pos@entry=0, var=...)
    at torch/csrc/autograd/input_buffer.cpp:17
17        if (!item.first.defined()) {
(gdb) bt
#0  torch::autograd::InputBuffer::add (this=this@entry=0x7ffe973afb10, pos=pos@entry=0, var=...)
    at torch/csrc/autograd/input_buffer.cpp:17
#1  0x00007fffb1951fad in torch::autograd::Engine::evaluate_function (
    this=this@entry=0x7fffb2b95d00 <engine>, task=...) at torch/csrc/autograd/engine.cpp:268
#2  0x00007fffb195354e in torch::autograd::Engine::thread_main (this=0x7fffb2b95d00 <engine>, 
    graph_task=0x0) at torch/csrc/autograd/engine.cpp:144
#3  0x00007fffb1950382 in torch::autograd::Engine::thread_init (
    this=this@entry=0x7fffb2b95d00 <engine>, device=device@entry=1)
    at torch/csrc/autograd/engine.cpp:121
#4  0x00007fffb19728ea in torch::autograd::python::PythonEngine::thread_init (
    this=0x7fffb2b95d00 <engine>, device=1) at torch/csrc/autograd/python_engine.cpp:28
#5  0x00007ffff3967c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7bc16ba in start_thread (arg=0x7ffe973b0700) at pthread_create.c:333
#7  0x00007ffff78f73dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

@apaszke
Copy link
Contributor
apaszke commented Oct 12, 2017

Hmm that's still not enough information to solve the issue. Can you please prepare a script that would let us reproduce the problem?

@esube
Copy link
Author
esube commented Oct 12, 2017

The code base is large to share. But, I just noticed that a medium network (ResNet34 base network with various modifications) produced the same segmentation fault although the GPUs (2 Titan X) didn't run out of memory. The strange thing is: switching to ResNet50 (used as feature extractor) instead of ResNet34 dies earlier. However, ResNet18 trains and finishes without a hitch. All the three networks use same code base, same dataset, same DataParallel (2 GPUs). It appears that some exceptions are not being handled.

The machine has an Intel Xeon CPU and Titan X GPUs. It could be related to #3089 so I am changing the out of memory error title on this one. For me, pytorch + cuda 8 + cudnn v5.1 was working without this segfault ever. I switch to cuda 9 + cudnn v7 this week and checked pytorch and rebuild it and this started happening.

[Update]

  • Also, I noticed (on htop) that quite a lot more threads are spawned in the new setup (cuda 9 + cudnn 7 + latest pytorch) than my previous setup.

  • I tried to run all the three networks on single GPU without DataParallel and to my surprise, all the three are running just fine without the segfault. My data for each minibatch is large and the input size varies randomly in each minibatch using adaptive pooling. Although, the GPU memory comes closer to full capacity on some minibatches, it runs fine on signle GPU that used to throw cuda memory error on two GPUs!

So, at this point, it appears that the segfault is related to DataParallel and I think, it is not related to out of memory or whether the platform was ppc64 (the case of #3089 ) or intel (my case). So, we can merge these two issues as one as the problem is similar.

@esube esube changed the title CUDA 9 + CUDNN 7 - out of memory seg fault CUDA 9 + CUDNN 7 - seg fault Oct 12, 2017
@lightChaserX
Copy link
lightChaserX commented Oct 14, 2017

Same error: https://github.com/mingyuliutw/UNIT/issues/13

When the iteration is about 100, the training will end and encounter a segmentation fault problem.

i.e.,

Iteration: 00000092/02000000
Iteration: 00000093/02000000
Iteration: 00000094/02000000
Iteration: 00000095/02000000
Iteration: 00000096/02000000
Iteration: 00000097/02000000
Iteration: 00000098/02000000
Iteration: 00000099/02000000
Iteration: 00000100/02000000
Segmentation fault

the stack-trace information
``
Iteration: 00000101/02000000
Iteration: 00000102/02000000
Iteration: 00000103/02000000
Iteration: 00000104/02000000
Iteration: 00000105/02000000
Iteration: 00000106/02000000
Iteration: 00000107/02000000
Iteration: 00000108/02000000

Program received signal SIGSEGV, Segmentation fault.
0x0000555555632cb0 in ?? ()
(gdb) where
#0 0x0000555555632cb0 in ?? ()
#1 0x0000555555632d95 in ?? ()
#2 0x0000555555631f45 in ?? ()
#3 0x0000555555629b64 in _PyObject_GC_Malloc ()
#4 0x000055555562962d in _PyObject_GC_New ()
#5 0x000055555567d991 in ?? ()
#6 0x000055555566b87f in PyObject_GetIter ()
#7 0x000055555564ff90 in PyEval_EvalFrameEx ()
#8 0x000055555564d285 in PyEval_EvalCodeEx ()
#9 0x000055555566a08e in ?? ()
#10 0x000055555563b983 in PyObject_Call ()
#11 0x0000555555659460 in PyEval_CallObjectWithKeywords ()
#12 0x00007fff8f37becd in THPFunction_apply (cls=0x5555569afc80, _inputs=0x7ffff342b050) at torch/csrc/autograd/python_function.cpp:721
#13 0x000055555564f1aa in PyEval_EvalFrameEx ()
#14 0x000055555564d285 in PyEval_EvalCodeEx ()
#15 0x0000555555654d49 in PyEval_EvalFrameEx ()
#16 0x000055555564d285 in PyEval_EvalCodeEx ()
#17 0x000055555566a248 in ?? ()
#18 0x000055555563b983 in PyObject_Call ()
#19 0x00005555556516bd in PyEval_EvalFrameEx ()
#20 0x000055555564d285 in PyEval_EvalCodeEx ()
#21 0x000055555566a08e in ?? ()
#22 0x000055555563b983 in PyObject_Call ()
#23 0x00005555556805de in ?? ()
#24 0x000055555563b983 in PyObject_Call ()
#25 0x00005555556de6a7 in ?? ()
#26 0x000055555563b983 in PyObject_Call ()
#27 0x0000555555654c5f in PyEval_EvalFrameEx ()
#28 0x000055555564d285 in PyEval_EvalCodeEx ()
#29 0x000055555566a248 in ?? ()
#30 0x000055555563b983 in PyObject_Call ()
#31 0x00005555556516bd in PyEval_EvalFrameEx ()
#32 0x000055555564d285 in PyEval_EvalCodeEx ()
#33 0x000055555566a08e in ?? ()
#34 0x000055555563b983 in PyObject_Call ()
#35 0x00005555556805de in ?? ()
#36 0x000055555563b983 in PyObject_Call ()
#37 0x00005555556de6a7 in ?? ()
#38 0x000055555563b983 in PyObject_Call ()
#39 0x0000555555654c5f in PyEval_EvalFrameEx ()
#40 0x000055555564d285 in PyEval_EvalCodeEx ()
#41 0x000055555566a248 in ?? ()
#42 0x000055555563b983 in PyObject_Call ()
#43 0x00005555556516bd in PyEval_EvalFrameEx ()
#44 0x000055555564d285 in PyEval_EvalCodeEx ()
#45 0x000055555566a08e in ?? ()
#46 0x000055555563b983 in PyObject_Call ()
#47 0x00005555556805de in ?? ()
#48 0x000055555563b983 in PyObject_Call ()
#49 0x00005555556de6a7 in ?? ()
#50 0x000055555563b983 in PyObject_Call ()
#51 0x0000555555654c5f in PyEval_EvalFrameEx ()
#52 0x000055555564d285 in PyEval_EvalCodeEx ()
#53 0x000055555566a248 in ?? ()
#54 0x000055555563b983 in PyObject_Call ()
---Type to continue, or q to quit---return
#55 0x00005555556516bd in PyEval_EvalFrameEx ()
#56 0x000055555564d285 in PyEval_EvalCodeEx ()
#57 0x000055555566a08e in ?? ()
#58 0x000055555563b983 in PyObject_Call ()
#59 0x00005555556805de in ?? ()
#60 0x000055555563b983 in PyObject_Call ()
#61 0x00005555556de6a7 in ?? ()
#62 0x000055555563b983 in PyObject_Call ()
#63 0x0000555555654c5f in PyEval_EvalFrameEx ()
#64 0x000055555564d285 in PyEval_EvalCodeEx ()
#65 0x000055555566a248 in ?? ()
#66 0x000055555563b983 in PyObject_Call ()
#67 0x00005555556516bd in PyEval_EvalFrameEx ()
#68 0x000055555564d285 in PyEval_EvalCodeEx ()
#69 0x000055555566a08e in ?? ()
#70 0x000055555563b983 in PyObject_Call ()
#71 0x00005555556805de in ?? ()
#72 0x000055555563b983 in PyObject_Call ()
#73 0x00005555556de6a7 in ?? ()
#74 0x000055555563b983 in PyObject_Call ()
#75 0x0000555555654c5f in PyEval_EvalFrameEx ()
#76 0x0000555555654a4f in PyEval_EvalFrameEx ()
#77 0x000055555564d285 in PyEval_EvalCodeEx ()
#78 0x000055555565555b in PyEval_EvalFrameEx ()
#79 0x000055555564d285 in PyEval_EvalCodeEx ()
#80 0x000055555564d029 in PyEval_EvalCode ()
#81 0x000055555567d42f in ?? ()
#82 0x00005555556783a2 in PyRun_FileExFlags ()
#83 0x0000555555677eee in PyRun_SimpleFileExFlags ()
#84 0x0000555555628ee1 in Py_Main ()
#85 0x00007ffff6f14b45 in __libc_start_main (main=0x555555628810

, argc=8, argv=0x7fffffffeba8, init=, fini=, rtld_fini=, stack_end=0x7fffffffeb98) at libc-start.c:287
#86 0x000055555562870a in _start ()

@apaszke
Copy link
Contributor
apaszke commented Oct 14, 2017

I understand that your codebase is large, but can you please try to write up a small sample that we could use to reproduce the problem? Just a tiny standalone snippet (e.g. create a model, wrap in data parallel, get random data, run and crash). It's really hard for us to debug issues via comments.

@lightChaserX
Copy link

@apaszke Thanks a lot. I have fixed this issue. This issue may be caused by dependency bug. I reinstalled python and all the dependencies using conda and set the version of cudnn to 6 (previously 5.1). There are no issues left.

@apaszke
Copy link
Contributor
apaszke commented Oct 16, 2017

@JhonsonWanger yeah, 5 is no longer supported

@mmderakhshani
Copy link

Hi there,
I have got the same error when forwarding a batch of input to my module. Here is the output of my gdb:

Thread 73 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffdd21c4700 (LWP 15179)]
0x00007fffa22ee8d5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0 0x00007fffa22ee8d5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1 0x00007fffa243e914 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007fffa23dae80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007ffff7bc16ba in start_thread (arg=0x7ffdd21c4700) at pthread_create.c:333
#4 0x00007ffff78f741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Is this error related to my cuda and cudnn?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0