-
Notifications
You must be signed in to 8000 change notification settings - Fork 387
Relay crash. Tx may send dead relay msg from tx_status_update
#9920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@Gerold103 @Serpentian I managed to accidently discover a crash in the relay, when running many instances of the (lldb) bt
* thread #1
* frame #0: 0x00007ff809edadb6 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007ff809f13ebd libsystem_pthread.dylib`pthread_kill + 262
frame #2: 0x00007ff809e39a79 libsystem_c.dylib`abort + 126
frame #3: 0x00007ff809e38d68 libsystem_c.dylib`__assert_rtn + 314
frame #4: 0x0000000105b7f896 tarantool`cpipe_push_input(pipe=0x00007fee55011b68, msg=0x00007fee55011bc8) at cbus.h:224:2
frame #5: 0x0000000105b7f5dd tarantool`cpipe_push(pipe=0x00007fee55011b68, msg=0x00007fee55011bc8) at cbus.h:241:2
frame #6: 0x0000000105b7f58f tarantool`tx_status_update(msg=0x00007fee55011bc8) at relay.cc:640:2
frame #7: 0x0000000105ce7acd tarantool`cmsg_deliver(msg=0x00007fee55011bc8) at cbus.c:350:2
frame #8: 0x0000000105ce9be4 tarantool`fiber_pool_f(ap=0x0000000107c12228) at fiber_pool.c:65:3
frame #9: 0x00000001059d7eea tarantool`fiber_cxx_invoke(f=(tarantool`fiber_pool_f at fiber_pool.c:38), ap=0x0000000107c12228) at fiber.h:1324:10
frame #10: 0x0000000105cdbd51 tarantool`fiber_loop(data=0x0000000000000000) at fiber.c:1162:18
frame #11: 0x00000001060eea2e tarantool`coro_init at coro.c:108:3 Coredump: https://drive.google.com/file/d/1DnA6vps_V3TbD93_YmDNxA27D_KpGqwP/view?usp=sharing Maybe it is a different bug, if so, please open a new issue. |
Nope, looks like this bug. Very interesting how did it manage to happen. |
The bug was discovered during the development of the quorum promote feature -- basically, it causes qpromote_several_outstanding_promotes_test.lua to fail spontaneously. Furthermore, we suspect that this is the underlying cause of lost heartbeats and subsequent severe replication lags. Long story short, whenever we want to send a message from the TX thread back to a relay thread, we should first check if they are still connected. Otherwise, we'll see * An assertion failure in debug, or * (Presumably) a relay hangup in release due to `if (status_msg->msg.route != NULL) return;` in relay_status_update() -> relay_check_status_needs_update(). The upstream is aware of this issue: tarantool#9920 Backtrace: ``` __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78 0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 0x00007f891f6334c3 in __GI_abort () at abort.c:79 0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224, function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94 0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224, function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber.h:1283 0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085 0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108 ``` Relevant frames: ``` 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 224 assert(loop() == pipe->producer); (gdb) 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 241 cpipe_push_input(pipe, msg); (gdb) 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 629 cpipe_push(&status->relay->relay_pipe, msg); (gdb) 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 553 msg->hop->f(msg); (gdb) 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 64 cmsg_deliver(msg); (gdb) ``` NO_DOC=internal NO_TEST=internal NO_CHANGELOG=internal
The bug was discovered during the development of the quorum promote feature -- basically, it causes qpromote_several_outstanding_promotes_test.lua to fail spontaneously. Furthermore, we suspect that this is the underlying cause of lost heartbeats and subsequent severe replication lags. Long story short, whenever we want to send a message from the TX thread back to a relay thread, we should first check if they are still connected. Otherwise, we'll see * An assertion failure in debug, or * (Presumably) a relay hangup in release due to `if (status_msg->msg.route != NULL) return;` in relay_status_update() -> relay_check_status_needs_update(). The upstream is aware of this issue: tarantool#9920 Backtrace: ``` __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78 0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 0x00007f891f6334c3 in __GI_abort () at abort.c:79 0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224, function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94 0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224, function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber.h:1283 0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085 0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108 ``` Relevant frames: ``` 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 224 assert(loop() == pipe->producer); (gdb) 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 241 cpipe_push_input(pipe, msg); (gdb) 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 629 cpipe_push(&status->relay->relay_pipe, msg); (gdb) 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 553 msg->hop->f(msg); (gdb) 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 64 cmsg_deliver(msg); (gdb) ``` NO_DOC=internal NO_TEST=internal NO_CHANGELOG=internal
The bug was discovered during the development of the quorum promote feature -- basically, it causes qpromote_several_outstanding_promotes_test.lua to fail spontaneously. Furthermore, we suspect that this is the underlying cause of lost heartbeats and subsequent severe replication lags. Long story short, whenever we want to send a message from the TX thread back to a relay thread, we should first check if they are still connected. Otherwise, we'll see * An assertion failure in debug, or * (Presumably) a relay hangup in release due to `if (status_msg->msg.route != NULL) return;` in relay_status_update() -> relay_check_status_needs_update(). The upstream is aware of this issue: tarantool#9920 Backtrace: ``` __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78 0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 0x00007f891f6334c3 in __GI_abort () at abort.c:79 0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224, function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94 0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224, function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber.h:1283 0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085 0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108 ``` Relevant frames: ``` 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 224 assert(loop() == pipe->producer); (gdb) 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 241 cpipe_push_input(pipe, msg); (gdb) 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 629 cpipe_push(&status->relay->relay_pipe, msg); (gdb) 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 553 msg->hop->f(msg); (gdb) 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 64 cmsg_deliver(msg); (gdb) ``` NO_DOC=internal NO_TEST=internal NO_CHANGELOG=internal
The bug was discovered during the development of the quorum promote feature -- basically, it causes qpromote_several_outstanding_promotes_test.lua to fail spontaneously. Furthermore, we suspect that this is the underlying cause of lost heartbeats and subsequent severe replication lags. Long story short, whenever we want to send a message from the TX thread back to a relay thread, we should first check if they are still connected. Otherwise, we'll see * An assertion failure in debug, or * (Presumably) a relay hangup in release due to `if (status_msg->msg.route != NULL) return;` in relay_status_update() -> relay_check_status_needs_update(). The upstream is aware of this issue: tarantool#9920 Backtrace: ``` __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78 0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 0x00007f891f6334c3 in __GI_abort () at abort.c:79 0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224, function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94 0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224, function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber.h:1283 0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085 0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108 ``` Relevant frames: ``` 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 224 assert(loop() == pipe->producer); (gdb) 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 241 cpipe_push_input(pipe, msg); (gdb) 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 629 cpipe_push(&status->relay->relay_pipe, msg); (gdb) 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 553 msg->hop->f(msg); (gdb) 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 64 cmsg_deliver(msg); (gdb) ``` NO_DOC=internal NO_TEST=internal NO_CHANGELOG=internal
The bug was discovered during the development of the quorum promote feature -- basically, it causes qpromote_several_outstanding_promotes_test.lua to fail spontaneously. Furthermore, we suspect that this is the underlying cause of lost heartbeats and subsequent severe replication lags. Long story short, whenever we want to send a message from the TX thread back to a relay thread, we should first check if they are still connected. Otherwise, we'll see * An assertion failure in debug, or * (Presumably) a relay hangup in release due to `if (status_msg->msg.route != NULL) return;` in relay_status_update() -> relay_check_status_needs_update(). The upstream is aware of this issue: tarantool#9920 Backtrace: ``` __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78 0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 0x00007f891f6334c3 in __GI_abort () at abort.c:79 0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224, function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94 0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224, function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber.h:1283 0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085 0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108 ``` Relevant frames: ``` 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 224 assert(loop() == pipe->producer); (gdb) 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 241 cpipe_push_input(pipe, msg); (gdb) 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 629 cpipe_push(&status->relay->relay_pipe, msg); (gdb) 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 553 msg->hop->f(msg); (gdb) 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 64 cmsg_deliver(msg); (gdb) ``` NO_DOC=internal NO_TEST=internal NO_CHANGELOG=internal
The bug was discovered during the development of the quorum promote feature -- basically, it causes qpromote_several_outstanding_promotes_test.lua to fail spontaneously. Furthermore, we suspect that this is the underlying cause of lost heartbeats and subsequent severe replication lags. Long story short, whenever we want to send a message from the TX thread back to a relay thread, we should first check if they are still connected. Otherwise, we'll see * An assertion failure in debug, or * (Presumably) a relay hangup in release due to `if (status_msg->msg.route != NULL) return;` in relay_status_update() -> relay_check_status_needs_update(). The upstream is aware of this issue: tarantool#9920 Backtrace: ``` __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78 0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 0x00007f891f6334c3 in __GI_abort () at abort.c:79 0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224, function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94 0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224, function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber.h:1283 0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085 0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108 ``` Relevant frames: ``` 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 224 assert(loop() == pipe->producer); (gdb) 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 241 cpipe_push_input(pipe, msg); (gdb) 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 629 cpipe_push(&status->relay->relay_pipe, msg); (gdb) 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 553 msg->hop->f(msg); (gdb) 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 64 cmsg_deliver(msg); (gdb) ``` NO_DOC=internal NO_TEST=internal NO_CHANGELOG=internal
The bug was discovered during the development of the quorum promote feature -- basically, it causes qpromote_several_outstanding_promotes_test.lua to fail spontaneously. Furthermore, we suspect that this is the underlying cause of lost heartbeats and subsequent severe replication lags. Long story short, whenever we want to send a message from the TX thread back to a relay thread, we should first check if they are still connected. Otherwise, we'll see * An assertion failure in debug, or * (Presumably) a relay hangup in release due to `if (status_msg->msg.route != NULL) return;` in relay_status_update() -> relay_check_status_needs_update(). The upstream is aware of this issue: tarantool#9920 Backtrace: ``` __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78 0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 0x00007f891f6334c3 in __GI_abort () at abort.c:79 0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224, function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94 0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224, function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber.h:1283 0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085 0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108 ``` Relevant frames: ``` 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 224 assert(loop() == pipe->producer); (gdb) 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 241 cpipe_push_input(pipe, msg); (gdb) 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 629 cpipe_push(&status->relay->relay_pipe, msg); (gdb) 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 553 msg->hop->f(msg); (gdb) 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 64 cmsg_deliver(msg); (gdb) ``` NO_DOC=internal NO_TEST=internal NO_CHANGELOG=internal
Apply the following patch: Diff
**diff --git a/src/box/relay.cc b/src/box/relay.cc
index 9c354a6f8a..dfde3dc9ea 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -604,6 +604,7 @@ tx_status_update(struct cmsg *msg)
* has no result yet, need a PROMOTE.
*/
raft_process_term(box_raft(), status->term, ack.source);
+ ERROR_INJECT_YIELD(ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY);
/*
* Let pending synchronous transactions know, which of
* them were successfully sent to the replica. Acks are
diff --git a/src/lib/core/errinj.h b/src/lib/core/errinj.h
index 420334a2e8..758918b96d 100644
--- a/src/lib/core/errinj.h
+++ b/src/lib/core/errinj.h
@@ -124,6 +124,7 @@ struct errinj {
_(ERRINJ_RELAY_REPORT_INTERVAL, ERRINJ_DOUBLE, {.dparam = 0}) \
_(ERRINJ_RELAY_SEND_DELAY, ERRINJ_BOOL, {.bparam = false}) \
_(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE, {.dparam = 0}) \
+ _(ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY, ERRINJ_BOOL, {.bparam = false}) \
_(ERRINJ_RELAY_WAL_START_DELAY, ERRINJ_BOOL, {.bparam = false}) \
_(ERRINJ_RELAY_READ_ACK_DELAY, ERRINJ_BOOL, {.bparam = false}) \
_(ERRINJ_REPLICASET_VCLOCK, ERRINJ_BOOL, {.bparam = false}) \
** Now start two instances: -- Instance 1:
tarantool> box.cfg{listen = 3302, replication = {3301, 3302}}
-- Instance 2:
tarantool> box.cfg{listen = 3301, replication = {3301, 3302}
-- Instance 1:
tarantool> box.error.injection.set('ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY', true)
-- Stop instance 2. Instance 1:
tarantool> box.info.replication
---
- 1:
id: 1
uuid: 1c897555-8174-4aa1-b4bd-cfdefd2872f0
lsn: 2
name: null
2:
id: 2
uuid: 1ba5fefb-5d25-4a52-84b9-841d97732379
lsn: 0
upstream:
peer: '3301'
lag: 0.00019454956054688
status: disconnected
idle: 4.8211434110017
message: 'connect, called on fd 23, aka 127.0.0.1:44838: Connection refused'
system_message: Connection refused
name: null
downstream:
status: stopped
message: 'unexpected EOF when reading from socket, called on fd 23, aka 127.0.0.1:3302,
peer of 127.0.0.1:40170: Broken pipe'
system_message: Broken pipe
...
tarantool> box.error.injection.set('ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY', false)
2025-02-15 03:00:45.767 [263420] main/104/interactive error.cc:391 I> ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY = false
---
- ok
...
tarantool> tarantool: src/lib/core/cbus.h:224: void cpipe_push(struct cpipe *, struct cmsg *): Assertion `pipe->producer == NULL || pipe->producer == loop()' failed.
zsh: IOT instruction (core dumped) ./src/tarantool -i
Happens, because TX thread may try to send msg to a dead relay. This is the last place in relay, which is not protected with |
tx_status_update
@funbringer, huge thanks again for the fix 💖, and now we have a reliable test, so we will apply your patch! Picodata could also take the test. |
In relay there's a message, which travels between relay and tx threads and which is used to deliver vclock of the relay to tx thread. However, when the message is in tx thread, it may happen, that relay thread dies and tx may try to send message back without checking, whether the thread is alive. Let's check `is_paired` flag, which is set to `false`, when relay thread dies, before sending the cbus message. Closes tarantool#9920 NO_DOC=bugfix
In relay there's a message, which travels between relay and tx threads and which is used to deliver vclock of the relay to tx thread. However, when the message is in tx thread, it may happen, that relay thread dies and tx may try to send message back without checking, whether the thread is alive. Let's check `is_paired` flag, which is set to `false`, when relay thread dies, before sending the cbus message. Closes #9920 NO_DOC=bugfix (cherry picked from commit be6fd7d)
In relay there's a message, which travels between relay and tx threads and which is used to deliver vclock of the relay to tx thread. However, when the message is in tx thread, it may happen, that relay thread dies and tx may try to send message back without checking, whether the thread is alive. Let's check `is_paired` flag, which is set to `false`, when relay thread dies, before sending the cbus message. Closes #9920 NO_DOC=bugfix (cherry picked from commit be6fd7d)
In relay there's a message, which travels between relay and tx threads and which is used to deliver vclock of the relay to tx thread. However, when the message is in tx thread, it may happen, that relay thread dies and tx may try to send message back without checking, whether the thread is alive. Let's check `is_paired` flag, which is set to `false`, when relay thread dies, before sending the cbus message. Closes #9920 NO_DOC=bugfix (cherry picked from commit be6fd7d)
In relay there's a message, which travels between relay and tx threads and which is used to deliver vclock of the relay to tx thread. However, when the message is in tx thread, it may happen, that relay thread dies and tx may try to send message back without checking, whether the thread is alive. Let's check `is_paired` flag, which is set to `false`, when relay thread dies, before sending the cbus message. Closes #9920 NO_DOC=bugfix (cherry picked from commit be6fd7d)
In relay there's a message, which travels between relay and tx threads and which is used to deliver vclock of the relay to tx thread. However, when the message is in tx thread, it may happen, that relay thread dies and tx may try to send message back without checking, whether the thread is alive. Let's check `is_paired` flag, which is set to `false`, when relay thread dies, before sending the cbus message. Closes #9920 NO_DOC=bugfix (cherry picked from commit be6fd7d)
In relay there's a message, which travels between relay and tx threads and which is used to deliver vclock of the relay to tx thread. However, when the message is in tx thread, it may happen, that relay thread dies and tx may try to send message back without checking, whether the thread is alive. Let's check `is_paired` flag, which is set to `false`, when relay thread dies, before sending the cbus message. Closes #9920 NO_DOC=bugfix (cherry picked from commit be6fd7d)
There is this patch from picodata: https://git.picodata.io/picodata/tarantool/-/merge_requests/160/diffs?commit_id=78da447d444adfd118c51bec88f6f269ca0d0a0f
It is claimed to fix a crash, apparently. I assume it targets a potential theoretical bug that the relay during the tx_endpoint unpair might receive a status response from tx, which will trigger sending the message back, and it would be sent back again after relay is already dead.
I was only able to reproduce it with a code hack:
And then this script:
It seems we should not push the message when the endpoint is being unpaired. But I can't see how it could crash in real life without my unrealistic change in the code.
tx_idle
will be always zero when the message just arrived and there was no yields yet.send_vclock
check wouldn't pass either - by the time of unpairing the relay isn't sending rows anymore. And thevclock_sync
doesn't seem to be related either (don't remember why).Other similar message handlers do filter out the relay being unpaired, somewhy. Why doesn't this place do the same? How to catch this crash normally?
The text was updated successfully, but these errors were encountered: