-
Notifications
You must be signed in to change notification settings - Fork 24.1k
CUDA 9 + CUDNN 7 - seg fault #3081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you please run your script under gdb ( |
|
Hmm that's still not enough information to solve the issue. Can you please prepare a script that would let us reproduce the problem? |
The code base is large to share. But, I just noticed that a medium network (ResNet34 base network with various modifications) produced the same segmentation fault although the GPUs (2 Titan X) didn't run out of memory. The strange thing is: switching to ResNet50 (used as feature extractor) instead of ResNet34 dies earlier. However, ResNet18 trains and finishes without a hitch. All the three networks use same code base, same dataset, same DataParallel (2 GPUs). It appears that some exceptions are not being handled. The machine has an Intel Xeon CPU and Titan X GPUs. It could be related to #3089 so I am changing the out of memory error title on this one. For me, pytorch + cuda 8 + cudnn v5.1 was working without this segfault ever. I switch to cuda 9 + cudnn v7 this week and checked pytorch and rebuild it and this started happening. [Update]
So, at this point, it appears that the segfault is related to DataParallel and I think, it is not related to out of memory or whether the platform was ppc64 (the case of #3089 ) or intel (my case). So, we can merge these two issues as one as the problem is similar. |
Same error: https://github.com/mingyuliutw/UNIT/issues/13 When the iteration is about 100, the training will end and encounter a segmentation fault problem. i.e.,
the stack-trace information Program received signal SIGSEGV, Segmentation fault. #86 0x000055555562870a in _start ()
|
I understand that your codebase is large, but can you please try to write up a small sample that we could use to reproduce the problem? Just a tiny standalone snippet (e.g. create a model, wrap in data parallel, get random data, run and crash). It's really hard for us to debug issues via comments. |
@apaszke Thanks a lot. I have fixed this issue. This issue may be caused by dependency bug. I reinstalled python and all the dependencies using conda and set the version of cudnn to 6 (previously 5.1). There are no issues left. |
@JhonsonWanger yeah, 5 is no longer supported |
Hi there,
Is this error related to my cuda and cudnn? |
I have update my cuda to 9 and cudnn to 7003 and and checked out the latest pytorch and compiled with the new setup. When an exception happens, it produces a segmentation fault.
Previously (cuda 8 and cudnn 5.1), it used to handle exceptions with error message and crash gracefully. The segmentation fault core dump leaves the processes hanging around and that is unpleasant. Smaller network or smaller minibatch size works fine with same code.
The text was updated successfully, but these errors were encountered: