[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest version fails to launch command with "ERROR 0 at threadsync.cpp:176 void dmtcp::ThreadSync::wrapperExecutionLockLock(): Failed to acquire lock" #1158

Open
NeuralModder opened this issue Oct 2, 2024 · 3 comments

Comments

@NeuralModder
Copy link

This program seems like it could be useful to me. I've tried multiple tagged versions and branches so far; it seems that each generates a different error (whether during compilation or trying to launch a program)

On the latest version (3.1.0), the program builds successfully. However, if I try to launch any program, I get this error:

[youser@hoastnayme ~/src/dmtcp/bin]$ ./dmtcp_launch cat
ERROR 0 at threadsync.cpp:176 void dmtcp::ThreadSync::wrapperExecutionLockLock(): Failed to acquire lock

Getting the coordinator logs by running the coordinator in one terminal and ./dmtcp_launch -j -p 7779 cat in another doesn't seem (to me) to provide particularly helpful information either:

[2024-10-02T17:01:03.452, 26674, 26674, Note] at dmtcp_coordinator.cpp:933 in initializeComputation; REASON='Resetting computation
[2024-10-02T17:01:03.452, 26674, 26674, Note] at dmtcp_coordinator.cpp:1042 in onConnect; REASON='worker connected
     hello_remote.from = 5092d13623c01989-26813-2fd28d405cefe
     client->progname() = cat
[2024-10-02T17:01:03.463, 26674, 26674, Note] at dmtcp_coordinator.cpp:891 in onDisconnect; REASON='client disconnected
     client->identity() = 5092d13623c01989-26813-2fd28d405cefe
     client->progname() = cat
dmtcp>

I've tried looking at the named code location in threadsync.cpp, but I'm not very familiar with C++ and am rather puzzled.

Here's some system information that might be relevant:

  • Linux distribution: Gentoo Linux
  • Linux kernel: 6.10.3 (custom compiled)
  • gcc (GNU Compiler Collection) version: 14.2.0
  • glibc version: 2.40

Could my glibc version be too recent? Is there anything I can try to solve this problem, or any more information I could provide?

@gc00
Copy link
Contributor
gc00 commented Oct 14, 2024

Hi @NeuralModder ,

Thanks for the feedback. I'm not able to reproduce this bug. For example, I'm using glibc-2.37, with Fedora 38, and I'm nnot seeing it. We are now also on DMTCP-3.1.2 (recent fix for a regression).

If you have the chance, could you try:

./configure --enable-debug && make clean && make -j8

and then try:

gdb --args bin/dmtcp_launch cat
(gdb) run
(gdb) thread apply all where

And then paste the result here. Thanks.

@NeuralModder
Copy link
Author

Hi @gc00,

Thanks for the response. I updated to 3.1.2 to be thorough, though this unsurprisingly did not resolve the issue.

I ran the configure script with --enable-debug and rebuilt the program, as you described, then I ran the gdb commands. Unfortunately, running that last command doesn't produce any output:

[youser@hoastnayme ~/src/dmtcp]$ gdb --args bin/dmtcp_launch cat
GNU gdb (Gentoo 15.2 vanilla) 15.2
[...]
Reading symbols from bin/dmtcp_launch...
(gdb) run
Starting program: /home/youser/src/dmtcp/bin/dmtcp_launch cat
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
[Detaching after fork from child process 20577]
[Detaching after fork from child process 20579]
[Detaching after vfork from child process 20580]
process 20574 is executing new program: /usr/bin/cat
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
ERROR 0 at threadsync.cpp:176 void dmtcp::ThreadSync::wrapperExecutionLockLock(): Failed to acquire lock
[Inferior 1 (process 20574) exited with code 0143]
(gdb) thread apply all where
(gdb)

Simply running where prints No stack.. I'm not quite knowledgeable enough with debuggers to understand why gdb can't print a backtrace here.

Nevertheless, using the file and line number that gets printed, I can set a breakpoint:

[youser@hoastnayme ~/src/dmtcp]$ gdb --args bin/dmtcp_launch cat
GNU gdb (Gentoo 15.2 vanilla) 15.2
[...]
Reading symbols from bin/dmtcp_launch...
(gdb) break threadsync.cpp:176
No source file named threadsync.cpp.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (threadsync.cpp:176) pending.
(gdb) run
Starting program: /home/youser/src/dmtcp/bin/dmtcp_launch cat
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
[Detaching after fork from child process 24336]
[Detaching after fork from child process 24338]
[Detaching after vfork from child process 24339]
process 24333 is executing new program: /usr/bin/cat
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".

Breakpoint 1, dmtcp::ThreadSync::wrapperExecutionLockLock () at threadsync.cpp:176
176                   errno, __FILE__, __LINE__, __PRETTY_FUNCTION__);

Here, I can get a backtrace:

(gdb) thread apply all where

Thread 1 (Thread 0x7ffff7734780 (LWP 24333) "cat"):
#0  dmtcp::ThreadSync::wrapperExecutionLockLock () at threadsync.cpp:176
#1  0x00007ffff7d39cbb in dmtcp_plugin_disable_ckpt () at dmtcpplugin.cpp:143
#2  0x00007ffff7fbd3e0 in free (ptr=0x0) at alloc/mallocwrappers.cpp:74
#3  0x00007ffff7b6f4ad in ?? () from /usr/lib64/libc.so.6
#4  0x00007ffff7b6f6b7 in ?? () from /usr/lib64/libc.so.6
#5  0x00007ffff7d83399 in dmtcp::Util::getTimestampStr[abi:cxx11]() () at util_misc.cpp:701
#6  0x00007ffff7d842f1 in jalib::getTimestampStr[abi:cxx11]() () at ../jalib/jalib.cpp:225
#7  0x00007ffff7d85fa9 in jassert_internal::JAssert::JAssert (this=0x7fffffffc7e0, type=jassert_internal::JAssert::JAssertType::Error) at ../jalib/jassert.cpp:78
#8  0x00007ffff7d66dd3 in dlsym_default_internal_library_handler (handle=0x7ffff7dd4580, symbol=0x7ffff7dab649 "dlopen", version=0x0, tags_p=0x0, default_symbol_index_p=0x0) at dmtcp_dlsym.cpp:354
#9  0x00007ffff7d6762b in dlsym_default_internal_flag_handler (handle=0xffffffffffffffff, libname=0x0, symbol=0x7ffff7dab649 "dlopen", version=0x0, addr=0x7ffff7d9871b <dmtcp_prepare_wrappers+28>, tags_p=0x0, default_symbol_index_p=0x0) at dmtcp_dlsym.cpp:478
#10 0x00007ffff7d96a00 in initialize_libc_wrappers () at syscallsreal.c:204
#11 0x00007ffff7d9871b in dmtcp_prepare_wrappers () at syscallsreal.c:230
#12 0x00007ffff7d99536 in _real_dup2 (oldfd=827, newfd=827) at syscallsreal.c:462
#13 0x00007ffff7d83e90 in jalib::dup2 (oldfd=827, newfd=827) at ../jalib/jalib.cpp:146
#14 0x00007ffff7d864b7 in jassert_internal::jassert_init () at ../jalib/jassert.cpp:160
#15 0x00007ffff7d83a7b in jalib_init (_jalibFuncPtrs=..., elfInterpreter=0x7ffff7da5828 "/lib64/ld-linux-x86-64.so.2", stderrFd=827, jassertLogFd=828, dmtcp_fail_rc=99) at ../jalib/jalib.cpp:66
#16 0x00007ffff7d679c0 in initializeJalib () at jalibinterface.cpp:65
#17 0x00007ffff7d842e1 in jalib::getTimestampStr[abi:cxx11]() () at ../jalib/jalib.cpp:225
#18 0x00007ffff7d85fa9 in jassert_internal::JAssert::JAssert (this=0x7fffffffd370, type=jassert_internal::JAssert::JAssertType::Error) at ../jalib/jassert.cpp:78
#19 0x00007ffff7d66dd3 in dlsym_default_internal_library_handler (handle=0x7ffff7dd4580, symbol=0x7ffff7dab649 "dlopen", version=0x0, tags_p=0x0, default_symbol_index_p=0x0) at dmtcp_dlsym.cpp:354
#20 0x00007ffff7d6762b in dlsym_default_internal_flag_handler (handle=0xffffffffffffffff, libname=0x0, symbol=0x7ffff7dab649 "dlopen", version=0x0, addr=0x7ffff7d9871b <dmtcp_prepare_wrappers+28>, tags_p=0x0, default_symbol_index_p=0x0) at dmtcp_dlsym.cpp:478
#21 0x00007ffff7d96a00 in initialize_libc_wrappers () at syscallsreal.c:204
#22 0x00007ffff7d9871b in dmtcp_prepare_wrappers () at syscallsreal.c:230
#23 0x00007ffff7d9b47f in _real_syscall (sys_num=186) at syscallsreal.c:845
#24 0x00007ffff7d53b57 in dmtcp::ThreadList::initThread (th=0x7ffff6740008) at threadlist.cpp:283
#25 0x00007ffff7d5383f in dmtcp::ThreadList::init () at threadlist.cpp:208
#26 0x00007ffff7d5362c in dmtcp_get_current_thread () at threadlist.cpp:125
#27 0x00007ffff7d57252 in dmtcp::ThreadSync::wrapperExecutionLockLock () at threadsync.cpp:170
#28 0x00007ffff7d39cbb in dmtcp_plugin_disable_ckpt () at dmtcpplugin.cpp:143
#29 0x00007ffff7fbd207 in malloc (size=73728) at alloc/mallocwrappers.cpp:40
#30 0x00007ffff78cd3f2 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/14/libstdc++.so.6
#31 0x00007ffff7fcd51e in call_init () from /lib64/ld-linux-x86-64.so.2
#32 0x00007ffff7fcd60c in _dl_init () from /lib64/ld-linux-x86-64.so.2
#33 0x00007ffff7fe54e0 in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#34 0x0000000000000001 in ?? ()
#35 0x00007fffffffe1f2 in ?? ()
#36 0x0000000000000000 in ?? ()

From here, I can step twice before the program terminates at the third step:

(gdb) step
175           fprintf(stderr, "ERROR %d at %s:%d %s: Failed to acquire lock\n",
(gdb) step
ERROR 0 at threadsync.cpp:176 void dmtcp::ThreadSync::wrapperExecutionLockLock(): Failed to acquire lock
177           _exit(DMTCP_FAIL_RC);
(gdb) step
[Inferior 1 (process 24333) exited with code 0143]

I can also set the breakpoint one stack frame earlier, and step through to the end:

[youser@hoastnayme ~/src/dmtcp]$ gdb --args bin/dmtcp_launch cat
GNU gdb (Gentoo 15.2 vanilla) 15.2
[...]
Reading symbols from bin/dmtcp_launch...
(gdb) break dmtcp_plugin_disable_ckpt()
Function "dmtcp_plugin_disable_ckpt()" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (dmtcp_plugin_disable_ckpt()) pending.
(gdb) run
Starting program: /home/youser/src/dmtcp/bin/dmtcp_launch cat
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
[Detaching after fork from child process 31630]
[Detaching after fork from child process 31632]
[Detaching after vfork from child process 31633]
process 31627 is executing new program: /usr/bin/cat
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".

Breakpoint 1, dmtcp_plugin_disable_ckpt () at dmtcpplugin.cpp:143
143       ThreadSync::wrapperExecutionLockLock();
(gdb) c
Continuing.

Breakpoint 1, dmtcp_plugin_disable_ckpt () at dmtcpplugin.cpp:143
143       ThreadSync::wrapperExecutionLockLock();
(gdb) step
dmtcp::ThreadSync::wrapperExecutionLockLock () at threadsync.cpp:168
168       int saved_errno = errno;
(gdb) step
170       Thread *thread = dmtcp_get_current_thread();
(gdb) step
dmtcp_get_current_thread () at threadlist.cpp:119
119     {
(gdb) step
120       if (curThread != nullptr) {
(gdb) step
121         return curThread;
(gdb) step
129     }
(gdb) step
dmtcp::ThreadSync::wrapperExecutionLockLock () at threadsync.cpp:172
172       if (thread->wrapperLockCount == 0) {
(gdb) step
174         if (DmtcpRWLockRdLock(&_wrapperExecutionLock) != 0) {
(gdb) step
DmtcpRWLockRdLock (rwlock=0x7ffff7dccf80 <_wrapperExecutionLock>) at rwlock.cpp:19
19      {
(gdb) step
21        if (rwlock->writerTid == gettid()) {
(gdb) step
gettid () at threadwrappers.cpp:43
43        Thread *thread = dmtcp_get_current_thread();
(gdb) step
dmtcp_get_current_thread () at threadlist.cpp:119
119     {
(gdb) step
120       if (curThread != nullptr) {
(gdb) step
121         return curThread;
(gdb) step
129     }
(gdb) step
gettid () at threadwrappers.cpp:44
44        return thread->tid;
(gdb) step
45      }
(gdb) step
DmtcpRWLockRdLock (rwlock=0x7ffff7dccf80 <_wrapperExecutionLock>) at rwlock.cpp:22
22          return EDEADLK;
(gdb) step
48      }
(gdb) step
dmtcp::ThreadSync::wrapperExecutionLockLock () at threadsync.cpp:176
176                   errno, __FILE__, __LINE__, __PRETTY_FUNCTION__);
(gdb) step
175           fprintf(stderr, "ERROR %d at %s:%d %s: Failed to acquire lock\n",
(gdb) step
ERROR 0 at threadsync.cpp:176 void dmtcp::ThreadSync::wrapperExecutionLockLock(): Failed to acquire lock
177           _exit(DMTCP_FAIL_RC);
(gdb) step
[Inferior 1 (process 31627) exited with code 0143]

So, taking a glance at that error in rwlock.cpp, there appears to be some kind of deadlock condition. That's all I can gleam, anyway.

I hope some of this is useful. I understand remotely debugging an unreproducible issue by proxy can be frustrating...

@gc00
Copy link
Contributor
gc00 commented Nov 27, 2024

@NeuralModder ,
I have a machine available now with glibc-2.37 and gcc-13.2. So, that should be close to your environment.
I tested again, and I still can't reproduce the bug.

Here are some general comments, in case it helps:

  1. I assume that when you do make check, the DMTCP test suite is working. Is it only dmtcp_launch cat that is failing?
  2. In looking through your stack trace, the best that I can say is that it is failing for you very early in the startup. Effectively, DMTCP reaches a new system call and tries to prepare the DMTCP wrapper function for it. To do this, DMTCP calls dlopen. But the call to dlopen fails with an assertion JASSERT. Looking at the source code for your assert: dmtcp_dlsym.cpp:354, it shows that DMTCP was calling dlsym_default_internal_library_handler(), and get_dt_tags() must have failed, and returned a NULL hash.

My best guess is that there might have been a change in some API from glibc-2.37 to glibc-2.40, but I'm not at all sure of that guess.

If the standard make check is also failing for you, we'd like to look at this more closely. If you're seeing make check fail, please let us know.

But otherwise, the easiest thing for us (assuming it's only dmtcp_launch cat that fails), will be to wait for one of our distros to start using glibc-2.40, and then we can look at this again.

I hope this is reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants