8000 [Bug] Segmentation fault when importing sentencepiece (with v0.4.0) · Issue #8358 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Bug] Segmentation fault when importing sentencepiece (with v0.4.0) #8358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ryonakamura opened this issue Jun 12, 2018 · 16 comments
Closed

Comments

@ryonakamura
Copy link
ryonakamura commented Jun 12, 2018

The following snippet reproduces a bug.

import sentencepiece as spm
import torch.nn as nn
l = nn.Linear(10, 10).cuda(0)

error:

Segmentation fault (core dumped)

This bug doesn’t occur at v0.3.1, it occurs at v0.4.0.

@ssnl
Copy link
Collaborator
ssnl commented Jun 12, 2018

I can't repro on master with debug build.

@bheinzerling
Copy link
Contributor

Getting a segfault on 0.5.0a0+f9633b9 when importing sentencepiece before torch.nn.
No segfault if the order of imports is switched.

@zou3519
Copy link
Contributor
zou3519 commented Jun 13, 2018

How did you install sentencepiece? What version of it do you have?

@bheinzerling
Copy link
Contributor

sentencepiece Python wrapper version 0.0.9, installed via pip

sentencepiece itself installed from source, looking at the date it should be this commit google/sentencepiece@c08f9c1

@ryonakamura
Copy link
Author

Python wrapper via pip for Ubuntu 16.04.4 LTS.
Error occurred on both sentencepiece v0.0.5 and v0.1.0.

@ssnl
Copy link
Collaborator
ssnl commented Jun 18, 2018

@ryonakamura Can you try gdb and give us the trace of the segfault? Thanks!

@bheinzerling
Copy link
Contributor
bheinzerling commented Jun 19, 2018

$ gdb python
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
...
(gdb) run
Starting program: /env/bin/python
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.

import sentencepiece
import torch
Missing separate debuginfo for /env/lib/python3.6/site-packages/numpy/core/../../../../libiomp5.so
Detaching after fork from child process 9027.
torch.tensor([0]).cuda()

Program received signal SIGSEGV, Segmentation fault.
0x000000000000000a in ?? ()
Missing separate debuginfos, use: debuginfo-install glibc-2.17-105.el7.x86_64
(gdb) backtrace
#0 0x000000000000000a in ?? ()
#1 0x00007ffff76c3bb0 in pthread_once () from /lib64/libpthread.so.0
#2 0x00007fffe1bc59b8 in at::Type::toBackend(at::Backend) const () from /env/lib/python3.6/site-packages/torch/lib/libATen_cpu.so
#3 0x00007fffe3334fd1 in torch::autograd::VariableType::toBackend (this=, b=) at torch/csrc/autograd/generated/VariableType.cpp:138
#4 0x00007fffe35c299d in torch::autograd::THPVariable_cuda (self=0x7fffefb98ab0, args=, kwargs=0x0) at torch/csrc/autograd/generated/python_variable_methods.cpp:326
#5 0x00007ffff7996302 in _PyCFunction_FastCallDict (func_obj=0x7fff7e259d38, args=0x7ffff7ebcd98, nargs=, kwargs=0x0) at Objects/methodobject.c:231
#6 0x00007ffff7a1bb8c in call_function (pp_stack=0x7fffffffc168, oparg=, kwnames=0x0) at Python/ceval.c:4809
#7 0x00007ffff7a1ed40 in _PyEval_EvalFrameDefault (f=, throwflag=) at Python/ceval.c:3295
#8 0x00007ffff7a1a100 in _PyEval_EvalCodeWithName (_co=0x7ffff7edbdb0, globals=, locals=, args=, argcount=0, kwnames=0x0, kwargs=0x8, kwcount=0, kwstep=2,
defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4139
#9 0x00007ffff7a1a583 in PyEval_EvalCodeEx (_co=, globals=, locals=, args=, argcount=, kws=, kwcount=0, defs=0x0,
defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4160
#10 0x00007ffff7a1a5cb in PyEval_EvalCode (co=, globals=, locals=) at Python/ceval.c:695
#11 0x00007ffff7a4f0f6 in run_mod (arena=0x7ffff7f78210, flags=0x7fffffffc4e0, locals=0x7ffff7f5bfc0, globals=0x7ffff7f5bfc0, filename=0x7ffff7f23068, mod=0x696d00) at Python/pythonrun.c:980
#12 PyRun_InteractiveOneObject (fp=, filename=0x7ffff7f23068, flags=0x7fffffffc4e0) at Python/pythonrun.c:233
#13 0x00007ffff7a4f45e in PyRun_InteractiveLoopFlags (fp=0x7ffff6da0640 <IO_2_1_stdin>, filename_str=, flags=0x7fffffffc4e0) at Python/pythonrun.c:112
#14 0x00007ffff7a4f59c in PyRun_AnyFileExFlags (fp=0x7ffff6da0640 <IO_2_1_stdin>, filename=0x7ffff7aea26b "", closeit=0, flags=0x7fffffffc4e0) at Python/pythonrun.c:74
#15 0x00007ffff7a69abb in run_file (p_cf=0x7fffffffc4e0, filename=0x0, fp=0x7ffff6da0640 <IO_2_1_stdin>) at Modules/main.c:338
#16 Py_Main (argc=, argv=) at Modules/main.c:810
#17 0x0000000000400c1d in main (argc=1, argv=) at ./Programs/python.c:69

@ryonakamura
Copy link
Author

@ssnl
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
...
(gdb) run
Starting program: /home/ryo/.pyenv/versions/anaconda3-4.1.1/bin/python
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Python 3.5.2 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.

import sentencepiece as spm
import torch.nn as nn
l = nn.Linear(10, 10).cuda(0)

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7f62f60 in ?? ()
(gdb) backtrace
#0 0x00007ffff7f62f60 in ?? ()
#1 0x00007ffff76d8a99 in __pthread_once_slow (once_control=0x7fffe054e4d8 at::globalContext()::globalContext_+408,
init_routine=0x7ffff4d63fe1 std::__once_proxy()) at pthread_once.c:116
#2 0x00007fffbcce8626 in at::Type::toBackend(at::Backend) const ()
from /home/ryo/.pyenv/versions/anaconda3-4.1.1/lib/python3.5/site-packages/torch/lib/libATen.so
#3 0x00007fffe1ec3a01 in torch::autograd::VariableType::toBackend (this=, b=)
at torch/csrc/autograd/generated/VariableType.cpp:90
#4 0x00007fffe20fa7cd in torch::autograd::THPVariable_cuda (self=0x7ffff6363048, args=0x7ffff7e7add8, kwargs=0x0)
at torch/csrc/autograd/generated/python_variable_methods.cpp:323
#5 0x00007ffff79a3621 in PyCFunction_Call (func=0x7fffba9a4630, args=0x7ffff7e7add8, kwds=) at Objects/methodobject.c:98
#6 0x00007ffff7a2abd5 in call_function (oparg=, pp_stack=0x7fffffffd628) at Python/ceval.c:4705
#7 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:3236
#8 0x00007ffff7a2bb49 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=1,
kws=0x7fffbabeeb90, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7ffff63aeb38, name=0x7ffff7e6ebb0, qualname=0x7fffbaebec60)
at Python/ceval.c:4018
#9 0x00007ffff7a2adf5 in fast_function (nk=, na=1, n=, pp_stack=0x7fffffffd848, func=0x7ffff7f29f28) at Python/ceval.c:4813
#10 call_function (oparg=, pp_stack=0x7fffffffd848) at Python/ceval.c:4730
#11 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:3236
#12 0x00007ffff7a2b166 in fast_function (nk=, na=2, n=, pp_stack=0x7fffffffd9c8, func=0x7fffbacdcbf8) at Python/ceval.c:4803
#13 call_function (oparg=, pp_stack=0x7fffffffd9c8) at Python/ceval.c:4730
#14 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:3236
---Type to continue, or q to quit---
#15 0x00007ffff7a2bb49 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=2,
kws=0x7ffff7f129b0, kwcount=0, defs=0x7fffbaed9220, defcount=1, kwdefs=0x0, closure=0x0, name=0x7ffff44493e8, qualname=0x7fffbaec7670)
at Python/ceval.c:4018
#16 0x00007ffff7a2adf5 in fast_function (nk=, na=2, n=, pp_stack=0x7fffffffdbe8, func=0x7fffbacdcd08) at Python/ceval.c:4813
#17 call_function (oparg=, pp_stack=0x7fffffffdbe8) at Python/ceval.c:4730
#18 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:3236
#19 0x00007ffff7a2bb49 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=0,
kws=0x0, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#20 0x00007ffff7a2bcd8 in PyEval_EvalCodeEx (_co=, globals=, locals=, args=,
argcount=, kws=, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#21 0x00007ffff7a2bd1b in PyEval_EvalCode (co=, globals=, locals=) at Python/ceval.c:777
#22 0x00007ffff7a53276 in run_mod (arena=0x73dfe0, flags=0x7fffffffdf50, locals=0x7ffff7f401c8, globals=0x7ffff7f401c8, filename=0x7ffff7eeb180,
mod=0x73fcd8) at Python/pythonrun.c:976
#23 PyRun_InteractiveOneObject (fp=, filename=0x7ffff7eeb180, flags=0x7fffffffdf50) at Python/pythonrun.c:233
#24 0x00007ffff7a535de in PyRun_InteractiveLoopFlags (fp=0x7ffff6dac8e0 <IO_2_1_stdin>, filename_str=, flags=0x7fffffffdf50)
at Python/pythonrun.c:112
#25 0x00007ffff7a5371c in PyRun_AnyFileExFlags (fp=0x7ffff6dac8e0 <IO_2_1_stdin>, filename=0x7ffff7aeae03 "", closeit=0, flags=0x7fffffffdf50)
at Python/pythonrun.c:74
#26 0x00007ffff7a6da02 in run_file (p_cf=0x7fffffffdf50, filename=0x0, fp=0x7ffff6dac8e0 <IO_2_1_stdin>) at Modules/main.c:318
#27 Py_Main (argc=, argv=) at Modules/main.c:769
#28 0x0000000000400add in main (argc=1, argv=0x7fffffffe0c8) at ./Programs/python.c:65

@ssnl
Copy link
Collaborator
ssnl commented Jun 19, 2018

@ryonakamura @bheinzerling Thank you for traces! We'll look into this.

@t-vi
Copy link
Collaborator
t-vi commented Jul 20, 2018

Isn't this likely to be caused by the gcc versions again?
PyPI currently seems to have sentencepiece compiled with an older GCC version:

$ strings -a _sentencepiece.cpython-36m-x86_64-linux-gnu.so |grep "GCC: ("
GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-55)
GCC: (GNU) 4.8.2 20140120 (Red Hat 4.8.2-15)
GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-55)

Indeed, I get the segfault when using that but it goes away when I compile my own with GCC 5.

@ssnl
Copy link
Collaborator
ssnl commented Jul 20, 2018

@t-vi Ah seems like that is the reason indeed!

@ssnl
Copy link
Collaborator
ssnl commented Jul 20, 2018

@t-vi Thank you!!!

I didn't look into it too much once I realized that this is not the pybind11 problem.

@ssnl ssnl closed this as completed Jul 20, 2018
@ssnl
Copy link
Collaborator
ssnl commented Jul 20, 2018

@bheinzerling @ryonakamura See @t-vi 's comment above. PyTorch binaries are compiled with gcc 4.9.2. However, gcc before and after that version are not incompatible. Hence the segfault you see when pulling them into the same address space. Using a sentencepiece compiled with later gcc will solve the issue.

@ryonakamura
Copy link
Author

Thank you @ssnl and @t-vi !!

@mmistele
Copy link
mmistele commented Aug 24, 2018

@t-vi Awesome to hear compiling with GCC 5 works! I'm super close to getting a working wheel compiled, but I ran into the following error when I run import sentencepiece after installing the wheel I compiled on a different machine than the one I compiled it on. Have you run into this, and if so, what did you do to fix it?
ImportError: libsentencepiece.so.0: cannot open shared object file: No such file or directory

Seems related to needing to run ldconfig -v after compiling the C++ part, but I already did so (in the docker container I used to compile it) - and I also ran auditwheel repair sentencepiece-0.1.4-cp36-cp36m-linux_x86_64.whl --plat linux_x86_64 to graft /usr/local/lib/libsentencepiece.so.0.0.0 -> .libs_sentencepiece/libsentencepiece-cf6cc06e.so.0.0.0.

@mmistele
Copy link
mmistele commented Aug 24, 2018

Good news, looks like @taku910 found the source of the bug (incompatibility around std::call_once and pthread_once involving protobuf) and is releasing a patch soon: google/sentencepiece#186

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants
0