8000 Relay crash. Tx may send dead relay msg from `tx_status_update` · Issue #9920 · tarantool/tarantool · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Relay crash. Tx may send dead relay msg from tx_status_update #9920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Gerold103 opened this issue Apr 8, 2024 · 4 comments · Fixed by #11322
Closed

Relay crash. Tx may send dead relay msg from tx_status_update #9920

Gerold103 opened this issue Apr 8, 2024 · 4 comments · Fixed by #11322
Assignees
Labels
2.11 Target is 2.11 and all newer release/master branches bug Something isn't working crash replication

Comments

@Gerold103
Copy link
Collaborator

There is this patch from picodata: https://git.picodata.io/picodata/tarantool/-/merge_requests/160/diffs?commit_id=78da447d444adfd118c51bec88f6f269ca0d0a0f

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 7450fa79c..39ee4a7d5 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -623,11 +623,16 @@ tx_status_update(struct cmsg *msg)
 	}
 	trigger_run(&replicaset.on_ack, &ack);
 
-	static const struct cmsg_hop route[] = {
-		{relay_status_update, NULL}
-	};
-	cmsg_init(msg, route);
-	cpipe_push(&status->relay->relay_pipe, msg);
+	if (status->relay->tx.is_paired) {
+		static const struct cmsg_hop route[] = {
+			{relay_status_update, NULL}
+		};
+		cmsg_init(msg, route);
+		cpipe_push(&status->relay->relay_pipe, msg);
+	} else {
+		assert(msg->hop->pipe == NULL);
+		cmsg_init(msg, NULL);
+	}
 }
 
 /**

It is claimed to fix a crash, apparently. I assume it targets a potential theoretical bug that the relay during the tx_endpoint unpair might receive a status response from tx, which will trigger sending the message back, and it would be sent back again after relay is already dead.

I was only able to reproduce it with a code hack:

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 3f3a8acb5..828bc8670 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -937,11 +937,11 @@ relay_check_status_needs_update(struct relay *relay)
 	/* Collect xlog files received by the replica. */
 	relay_schedule_pending_gc(relay, send_vclock);
 
-	double tx_idle = ev_monotonic_now(loop()) - relay->tx_seen_time;
-	if (vclock_sum(&status_msg->vclock) ==
-	    vclock_sum(send_vclock) && tx_idle <= replication_timeout &&
-	    status_msg->vclock_sync == last_recv_ack->vclock_sync)
-		return;
+	// double tx_idle = ev_monotonic_now(loop()) - relay->tx_seen_time;
+	// if (vclock_sum(&status_msg->vclock) ==
+	//     vclock_sum(send_vclock) && tx_idle <= replication_timeout &&
+	//     status_msg->vclock_sync == last_recv_ack->vclock_sync)
+	// 	return;
 	static const struct cmsg_hop route[] = {
 		{tx_status_update, NULL}
 	};

And then this script:

--
-- Instance 1
--
-- Step 1
--
box.cfg{
    listen = 3313,
    replication_timeout = 0.1,
}
box.schema.user.grant('guest', 'super')

--
-- Step 3
--
box.error.injection.set("ERRINJ_RELAY_FROM_TX_DELAY", true)



--
-- Instance 2
--
-- Step 2
--
box.cfg{
    listen = 3314,
    replication = 3313,
    replication_timeout = 0.1,
}

It seems we should not push the message when the endpoint is being unpaired. But I can't see how it could crash in real life without my unrealistic change in the code. tx_idle will be always zero when the message just arrived and there was no yields yet. send_vclock check wouldn't pass either - by the time of unpairing the relay isn't sending rows anymore. And the vclock_sync doesn't seem to be related either (don't remember why).

Other similar message handlers do filter out the relay being unpaired, somewhy. Why doesn't this place do the same? How to catch this crash normally?

@CuriousGeorgiy
Copy link
Member
CuriousGeorgiy commented Aug 25, 2024

@Gerold103 @Serpentian I managed to accidently discover a crash in the relay, when running many instances of the replication-luatest/gh_9918_synchro_queue_additional_info_test.lua test in parallel:

(lldb) bt
* thread #1
  * frame #0: 0x00007ff809edadb6 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007ff809f13ebd libsystem_pthread.dylib`pthread_kill + 262
    frame #2: 0x00007ff809e39a79 libsystem_c.dylib`abort + 126
    frame #3: 0x00007ff809e38d68 libsystem_c.dylib`__assert_rtn + 314
    frame #4: 0x0000000105b7f896 tarantool`cpipe_push_input(pipe=0x00007fee55011b68, msg=0x00007fee55011bc8) at cbus.h:224:2
    frame #5: 0x0000000105b7f5dd tarantool`cpipe_push(pipe=0x00007fee55011b68, msg=0x00007fee55011bc8) at cbus.h:241:2
    frame #6: 0x0000000105b7f58f tarantool`tx_status_update(msg=0x00007fee55011bc8) at relay.cc:640:2
    frame #7: 0x0000000105ce7acd tarantool`cmsg_deliver(msg=0x00007fee55011bc8) at cbus.c:350:2
    frame #8: 0x0000000105ce9be4 tarantool`fiber_pool_f(ap=0x0000000107c12228) at fiber_pool.c:65:3
    frame #9: 0x00000001059d7eea tarantool`fiber_cxx_invoke(f=(tarantool`fiber_pool_f at fiber_pool.c:38), ap=0x0000000107c12228) at fiber.h:1324:10
    frame #10: 0x0000000105cdbd51 tarantool`fiber_loop(data=0x0000000000000000) at fiber.c:1162:18
    frame #11: 0x00000001060eea2e tarantool`coro_init at coro.c:108:3

Coredump: https://drive.google.com/file/d/1DnA6vps_V3TbD93_YmDNxA27D_KpGqwP/view?usp=sharing
Tarantool binary (macOS): https://drive.google.com/file/d/1kGMkg4YUIM8hgbJRstrbtNuMU_4nrpR0/view?usp=sharing

Maybe it is a different bug, if so, please open a new issue.

@Gerold103
Copy link
Collaborator Author

Nope, looks like this bug. Very interesting how did it manage to happen.

picodata-account pushed a commit to picodata/tarantool that referenced this issue Dec 5, 2024
The bug was discovered during the development of the quorum promote feature
-- basically, it causes qpromote_several_outstanding_promotes_test.lua
to fail spontaneously. Furthermore, we suspect that this is the underlying
cause of lost heartbeats and subsequent severe replication lags.

Long story short, whenever we want to send a message from the TX thread
back to a relay thread, we should first check if they are still
connected. Otherwise, we'll see
  * An assertion failure in debug, or
  * (Presumably) a relay hangup in release due to
    `if (status_msg->msg.route != NULL) return;`
    in relay_status_update() -> relay_check_status_needs_update().

The upstream is aware of this issue:
tarantool#9920

Backtrace:

```
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
0x00007f891f6334c3 in __GI_abort () at abort.c:79
0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224,
    function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94
0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224,
    function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8)
    at tarantool/src/lib/core/fiber.h:1283
0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085
0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108
```

Relevant frames:

```
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
224             assert(loop() == pipe->producer);
(gdb)
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
241             cpipe_push_input(pipe, msg);
(gdb)
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
629             cpipe_push(&status->relay->relay_pipe, msg);
(gdb)
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
553             msg->hop->f(msg);
(gdb)
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
64              cmsg_deliver(msg);
(gdb)
```

NO_DOC=internal
NO_TEST=internal
NO_CHANGELOG=internal
picodata-account pushed a commit to picodata/tarantool that referenced this issue Dec 11, 2024
The bug was discovered during the development of the quorum promote feature
-- basically, it causes qpromote_several_outstanding_promotes_test.lua
to fail spontaneously. Furthermore, we suspect that this is the underlying
cause of lost heartbeats and subsequent severe replication lags.

Long story short, whenever we want to send a message from the TX thread
back to a relay thread, we should first check if they are still
connected. Otherwise, we'll see
  * An assertion failure in debug, or
  * (Presumably) a relay hangup in release due to
    `if (status_msg->msg.route != NULL) return;`
    in relay_status_update() -> relay_check_status_needs_update().

The upstream is aware of this issue:
tarantool#9920

Backtrace:

```
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
0x00007f891f6334c3 in __GI_abort () at abort.c:79
0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224,
    function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94
0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224,
    function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8)
    at tarantool/src/lib/core/fiber.h:1283
0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085
0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108
```

Relevant frames:

```
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
224             assert(loop() == pipe->producer);
(gdb)
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
241             cpipe_push_input(pipe, msg);
(gdb)
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
629             cpipe_push(&status->relay->relay_pipe, msg);
(gdb)
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
553             msg->hop->f(msg);
(gdb)
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
64              cmsg_deliver(msg);
(gdb)
```

NO_DOC=internal
NO_TEST=internal
NO_CHANGELOG=internal
picodata-account pushed a commit to picodata/tarantool that referenced this issue Dec 11, 2024
The bug was discovered during the development of the quorum promote feature
-- basically, it causes qpromote_several_outstanding_promotes_test.lua
to fail spontaneously. Furthermore, we suspect that this is the underlying
cause of lost heartbeats and subsequent severe replication lags.

Long story short, whenever we want to send a message from the TX thread
back to a relay thread, we should first check if they are still
connected. Otherwise, we'll see
  * An assertion failure in debug, or
  * (Presumably) a relay hangup in release due to
    `if (status_msg->msg.route != NULL) return;`
    in relay_status_update() -> relay_check_status_needs_update().

The upstream is aware of this issue:
tarantool#9920

Backtrace:

```
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
0x00007f891f6334c3 in __GI_abort () at abort.c:79
0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224,
    function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94
0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224,
    function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8)
    at tarantool/src/lib/core/fiber.h:1283
0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085
0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108
```

Relevant frames:

```
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
224             assert(loop() == pipe->producer);
(gdb)
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
241             cpipe_push_input(pipe, msg);
(gdb)
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
629             cpipe_push(&status->relay->relay_pipe, msg);
(gdb)
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
553             msg->hop->f(msg);
(gdb)
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
64              cmsg_deliver(msg);
(gdb)
```

NO_DOC=internal
NO_TEST=internal
NO_CHANGELOG=internal
picodata-account pushed a commit to picodata/tarantool that referenced this issue Dec 11, 2024
The bug was discovered during the development of the quorum promote feature
-- basically, it causes qpromote_several_outstanding_promotes_test.lua
to fail spontaneously. Furthermore, we suspect that this is the underlying
cause of lost heartbeats and subsequent severe replication lags.

Long story short, whenever we want to send a message from the TX thread
back to a relay thread, we should first check if they are still
connected. Otherwise, we'll see
  * An assertion failure in debug, or
  * (Presumably) a relay hangup in release due to
    `if (status_msg->msg.route != NULL) return;`
    in relay_status_update() -> relay_check_status_needs_update().

The upstream is aware of this issue:
tarantool#9920

Backtrace:

```
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
0x00007f891f6334c3 in __GI_abort () at abort.c:79
0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224,
    function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94
0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224,
    function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8)
    at tarantool/src/lib/core/fiber.h:1283
0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085
0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108
```

Relevant frames:

```
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
224             assert(loop() == pipe->producer);
(gdb)
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
241             cpipe_push_input(pipe, msg);
(gdb)
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
629             cpipe_push(&status->relay->relay_pipe, msg);
(gdb)
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
553             msg->hop->f(msg);
(gdb)
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
64              cmsg_deliver(msg);
(gdb)
```

NO_DOC=internal
NO_TEST=internal
NO_CHANGELOG=internal
picodata-account pushed a commit to picodata/tarantool that referenced this issue Dec 11, 2024
The bug was discovered during the development of the quorum promote feature
-- basically, it causes qpromote_several_outstanding_promotes_test.lua
to fail spontaneously. Furthermore, we suspect that this is the underlying
cause of lost heartbeats and subsequent severe replication lags.

Long story short, whenever we want to send a message from the TX thread
back to a relay thread, we should first check if they are still
connected. Otherwise, we'll see
  * An assertion failure in debug, or
  * (Presumably) a relay hangup in release due to
    `if (status_msg->msg.route != NULL) return;`
    in relay_status_update() -> relay_check_status_needs_update().

The upstream is aware of this issue:
tarantool#9920

Backtrace:

```
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
0x00007f891f6334c3 in __GI_abort () at abort.c:79
0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224,
    function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94
0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224,
    function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8)
    at tarantool/src/lib/core/fiber.h:1283
0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085
0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108
```

Relevant frames:

```
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
224             assert(loop() == pipe->producer);
(gdb)
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
241             cpipe_push_input(pipe, msg);
(gdb)
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
629             cpipe_push(&status->relay->relay_pipe, msg);
(gdb)
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
553             msg->hop->f(msg);
(gdb)
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
64              cmsg_deliver(msg);
(gdb)
```

NO_DOC=internal
NO_TEST=internal
NO_CHANGELOG=internal
picodata-account pushed a commit to picodata/tarantool that referenced this issue Dec 11, 2024
The bug was discovered during the development of the quorum promote feature
-- basically, it causes qpromote_several_outstanding_promotes_test.lua
to fail spontaneously. Furthermore, we suspect that this is the underlying
cause of lost heartbeats and subsequent severe replication lags.

Long story short, whenever we want to send a message from the TX thread
back to a relay thread, we should first check if they are still
connected. Otherwise, we'll see
  * An assertion failure in debug, or
  * (Presumably) a relay hangup in release due to
    `if (status_msg->msg.route != NULL) return;`
    in relay_status_update() -> relay_check_status_needs_update().

The upstream is aware of this issue:
tarantool#9920

Backtrace:

```
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
0x00007f891f6334c3 in __GI_abort () at abort.c:79
0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224,
    function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94
0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224,
    function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8)
    at tarantool/src/lib/core/fiber.h:1283
0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085
0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108
```

Relevant frames:

```
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
224             assert(loop() == pipe->producer);
(gdb)
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
241             cpipe_push_input(pipe, msg);
(gdb)
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
629             cpipe_push(&status->relay->relay_pipe, msg);
(gdb)
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
553             msg->hop->f(msg);
(gdb)
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
64              cmsg_deliver(msg);
(gdb)
```

NO_DOC=internal
NO_TEST=internal
NO_CHANGELOG=internal
picodata-account pushed a commit to picodata/tarantool that referenced this issue Dec 11, 2024
The bug was discovered during the development of the quorum promote feature
-- basically, it causes qpromote_several_outstanding_promotes_test.lua
to fail spontaneously. Furthermore, we suspect that this is the underlying
cause of lost heartbeats and subsequent severe replication lags.

Long story short, whenever we want to send a message from the TX thread
back to a relay thread, we should first check if they are still
connected. Otherwise, we'll see
  * An assertion failure in debug, or
  * (Presumably) a relay hangup in release due to
    `if (status_msg->msg.route != NULL) return;`
    in relay_status_update() -> relay_check_status_needs_update().

The upstream is aware of this issue:
tarantool#9920

Backtrace:

```
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
0x00007f891f6334c3 in __GI_abort () at abort.c:79
0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224,
    function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94
0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224,
    function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8)
    at tarantool/src/lib/core/fiber.h:1283
0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085
0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108
```

Relevant frames:

```
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
224             assert(loop() == pipe->producer);
(gdb)
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
241             cpipe_push_input(pipe, msg);
(gdb)
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
629             cpipe_push(&status->relay->relay_pipe, msg);
(gdb)
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
553             msg->hop->f(msg);
(gdb)
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
64              cmsg_deliver(msg);
(gdb)
```

NO_DOC=internal
NO_TEST=internal
NO_CHANGELOG=internal
@sergepetrenko sergepetrenko moved this to Q2 2025 – Apr-Jun in Tarantool Public Roadmap Dec 25, 2024
@Serpentian
Copy link
Contributor
Serpentian commented Feb 15, 2025

Apply the following patch:

Diff

**diff --git a/src/box/relay.cc b/src/box/relay.cc
index 9c354a6f8a..dfde3dc9ea 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -604,6 +604,7 @@ tx_status_update(struct cmsg *msg)
 	 * has no result yet, need a PROMOTE.
 	 */
 	raft_process_term(box_raft(), status->term, ack.source);
+	ERROR_INJECT_YIELD(ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY);
 	/*
 	 * Let pending synchronous transactions know, which of
 	 * them were successfully sent to the replica. Acks are
diff --git a/src/lib/core/errinj.h b/src/lib/core/errinj.h
index 420334a2e8..758918b96d 100644
--- a/src/lib/core/errinj.h
+++ b/src/lib/core/errinj.h
@@ -124,6 +124,7 @@ struct errinj {
 	_(ERRINJ_RELAY_REPORT_INTERVAL, ERRINJ_DOUBLE, {.dparam = 0}) \
 	_(ERRINJ_RELAY_SEND_DELAY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE, {.dparam = 0}) \
+	_(ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_RELAY_WAL_START_DELAY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_RELAY_READ_ACK_DELAY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_REPLICASET_VCLOCK, ERRINJ_BOOL, {.bparam = false}) \
**

Now start two instances:

-- Instance 1:
tarantool> box.cfg{listen = 3302, replication = {3301, 3302}}
-- Instance 2:
tarantool> box.cfg{listen = 3301, replication = {3301, 3302}
-- Instance 1:
tarantool> box.error.injection.set('ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY', true)
-- Stop instance 2. Instance 1:
tarantool> box.info.replication
---
- 1:
    id: 1
    uuid: 1c897555-8174-4aa1-b4bd-cfdefd2872f0
    lsn: 2
    name: null
  2:
    id: 2
    uuid: 1ba5fefb-5d25-4a52-84b9-841d97732379
    lsn: 0
    upstream:
      peer: '3301'
      lag: 0.00019454956054688
      status: disconnected
      idle: 4.8211434110017
      message: 'connect, called on fd 23, aka 127.0.0.1:44838: Connection refused'
      system_message: Connection refused
    name: null
    downstream:
      status: stopped
      message: 'unexpected EOF when reading from socket, called on fd 23, aka 127.0.0.1:3302,
        peer of 127.0.0.1:40170: Broken pipe'
      system_message: Broken pipe
...
tarantool> box.error.injection.set('ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY', false)
2025-02-15 03:00:45.767 [263420] main/104/interactive error.cc:391 I> ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY = false
---
- ok
...

tarantool> tarantool: src/lib/core/cbus.h:224: void cpipe_push(struct cpipe *, struct cmsg *): Assertion `pipe->producer == NULL || pipe->producer == loop()' failed.
zsh: IOT instruction (core dumped)  ./src/tarantool -i
(gdb) bt
#0  0x00007ffff4ca4d7c in __pthread_kill_implementation ()
   from /nix/store/9y8pmvk8gdwwznmkzxa6pwyah52xy3nk-glibc-2.38-27/lib/libc.so.6
#1  0x00007ffff4c559c6 in raise ()
   from /nix/store/9y8pmvk8gdwwznmkzxa6pwyah52xy3nk-glibc-2.38-27/lib/libc.so.6
#2  0x00007ffff4c3e8fa in abort ()
   from /nix/store/9y8pmvk8gdwwznmkzxa6pwyah52xy3nk-glibc-2.38-27/lib/libc.so.6
#3  0x00007ffff4c3e819 in __assert_fail_base.cold ()
   from /nix/store/9y8pmvk8gdwwznmkzxa6pwyah52xy3nk-glibc-2.38-27/lib/libc.so.6
#4  0x00007ffff4c4e3c6 in __assert_fail ()
   from /nix/store/9y8pmvk8gdwwznmkzxa6pwyah52xy3nk-glibc-2.38-27/lib/libc.so.6
#5  0x000055555574e818 in cpipe_push (pipe=0x5555563002b8, msg=0x555556300318)
    at /home/serpentian/Programming/tnt/tarantool/src/lib/core/cbus.h:224
#6  0x000055555574e711 in tx_status_update (msg=0x555556300318)
    at /home/serpentian/Programming/tnt/tarantool/src/box/relay.cc:625
#7  0x00005555558acafd in cmsg_deliver (msg=0x555556300318)
    at /home/serpentian/Programming/tnt/tarantool/src/lib/core/cbus.c:360
#8  0x00005555558aec44 in fiber_pool_f (ap=0x7fffe6811718)
    at /home/serpentian/Programming/tnt/tarantool/src/lib/core/fiber_pool.c:65
#9  0x00005555555a1fea in fiber_cxx_invoke (f=0x5555558ae870 <fiber_pool_f>, 
    ap=0x7fffe6811718) at lib/core/fiber.h:1324
#10 0x00005555558a09e5 in fiber_loop (data=0x0)
    at /home/serpentian/Programming/tnt/tarantool/src/lib/core/fiber.c:1167
#11 0x0000555555cdc4a6 in coro_init ()
    at /home/serpentian/Programming/tnt/tarantool/third_party/coro/coro.c:108

Happens, because TX thread may try to send msg to a dead relay. This is the last place in relay, which is not protected with tx.is_paired, so the patch from picodata is correct. But I haven't managed to write 100% repro for it yet.

@Serpentian Serpentian changed the title Relay crash? Relay crash. Tx may send dead relay msg from tx_status_update Feb 15, 2025
@Gerold103
Copy link
Collaborator Author

@funbringer, huge thanks again for the fix 💖, and now we have a reliable test, so we will apply your patch! Picodata could also take the test.

Serpentian added a commit to Serpentian/tarantool that referenced this issue Apr 1, 2025
In relay there's a message, which travels between relay and tx
threads and which is used to deliver vclock of the relay to tx
thread.

However, when the message is in tx thread, it may happen, that
relay thread dies and tx may try to send message back without
checking, whether the thread is alive.

Let's check `is_paired` flag, which is set to `false`, when
relay thread dies, before sending the cbus message.

Closes tarantool#9920

NO_DOC=bugfix
@sergepetrenko sergepetrenko added the 2.11 Target is 2.11 and all newer release/master branches label Apr 3, 2025
github-actions bot pushed a commit that referenced this issue Apr 3, 2025
In relay there's a message, which travels between relay and tx
threads and which is used to deliver vclock of the relay to tx
thread.

However, when the message is in tx thread, it may happen, that
relay thread dies and tx may try to send message back without
checking, whether the thread is alive.

Let's check `is_paired` flag, which is set to `false`, when
relay thread dies, before sending the cbus message.

Closes #9920

NO_DOC=bugfix

(cherry picked from commit be6fd7d)
github-actions bot pushed a commit that referenced this issue Apr 3, 2025
In relay there's a message, which travels between relay and tx
threads and which is used to deliver vclock of the relay to tx
thread.

However, when the message is in tx thread, it may happen, that
relay thread dies and tx may try to send message back without
checking, whether the thread is alive.

Let's check `is_paired` flag, which is set to `false`, when
relay thread dies, before sending the cbus message.

Closes #9920

NO_DOC=bugfix

(cherry picked from commit be6fd7d)
github-actions bot pushed a commit that referenced this issue Apr 3, 2025
In relay there's a message, which travels between relay and tx
threads and which is used to deliver vclock of the relay to tx
thread.

However, when the message is in tx thread, it may happen, that
relay thread dies and tx may try to send message back without
checking, whether the thread is alive.

Let's check `is_paired` flag, which is set to `false`, when
relay thread dies, before sending the cbus message.

Closes #9920

NO_DOC=bugfix

(cherry picked from commit be6fd7d)
sergepetrenko pushed a commit that referenced this issue Apr 3, 2025
In relay there's a message, which travels between relay and tx
threads and which is used to deliver vclock of the relay to tx
thread.

However, when the message is in tx thread, it may happen, that
relay thread dies and tx may try to send message back without
checking, whether the thread is alive.

Let's check `is_paired` flag, which is set to `false`, when
relay thread dies, before sending the cbus message.

Closes #9920

NO_DOC=bugfix

(cherry picked from commit be6fd7d)
sergepetrenko pushed a commit that referenced this issue Apr 3, 2025
In relay there's a message, which travels between relay and tx
threads and which is used to deliver vclock of the relay to tx
thread.

However, when the message is in tx thread, it may happen, that
relay thread dies and tx may try to send message back without
checking, whether the thread is alive.

Let's check `is_paired` flag, which is set to `false`, when
relay thread dies, before sending the cbus message.

Closes #9920

NO_DOC=bugfix

(cherry picked from commit be6fd7d)
sergepetrenko pushed a commit that referenced this issue Apr 3, 2025
In relay there's a message, which travels between relay and tx
threads and which is used to deliver vclock of the relay to tx
thread.

However, when the message is in tx thread, it may happen, that
relay thread dies and tx may try to send message back without
checking, whether the thread is alive.

Let's check `is_paired` flag, which is set to `false`, when
relay thread dies, before sending the cbus message.

Closes #9920

NO_DOC=bugfix

(cherry picked from commit be6fd7d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.11 Target is 2.11 and all newer release/master branches bug Something isn't working crash replication
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0