Relay crash. Tx may send dead relay msg from `tx_status_update` #9920

Gerold103 · 2024-04-08T22:40:49Z

There is this patch from picodata: https://git.picodata.io/picodata/tarantool/-/merge_requests/160/diffs?commit_id=78da447d444adfd118c51bec88f6f269ca0d0a0f

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 7450fa79c..39ee4a7d5 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -623,11 +623,16 @@ tx_status_update(struct cmsg *msg)
 	}
 	trigger_run(&replicaset.on_ack, &ack);
 
-	static const struct cmsg_hop route[] = {
-		{relay_status_update, NULL}
-	};
-	cmsg_init(msg, route);
-	cpipe_push(&status->relay->relay_pipe, msg);
+	if (status->relay->tx.is_paired) {
+		static const struct cmsg_hop route[] = {
+			{relay_status_update, NULL}
+		};
+		cmsg_init(msg, route);
+		cpipe_push(&status->relay->relay_pipe, msg);
+	} else {
+		assert(msg->hop->pipe == NULL);
+		cmsg_init(msg, NULL);
+	}
 }
 
 /**

It is claimed to fix a crash, apparently. I assume it targets a potential theoretical bug that the relay during the tx_endpoint unpair might receive a status response from tx, which will trigger sending the message back, and it would be sent back again after relay is already dead.

I was only able to reproduce it with a code hack:

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 3f3a8acb5..828bc8670 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -937,11 +937,11 @@ relay_check_status_needs_update(struct relay *relay)
 	/* Collect xlog files received by the replica. */
 	relay_schedule_pending_gc(relay, send_vclock);
 
-	double tx_idle = ev_monotonic_now(loop()) - relay->tx_seen_time;
-	if (vclock_sum(&status_msg->vclock) ==
-	    vclock_sum(send_vclock) && tx_idle <= replication_timeout &&
-	    status_msg->vclock_sync == last_recv_ack->vclock_sync)
-		return;
+	// double tx_idle = ev_monotonic_now(loop()) - relay->tx_seen_time;
+	// if (vclock_sum(&status_msg->vclock) ==
+	//     vclock_sum(send_vclock) && tx_idle <= replication_timeout &&
+	//     status_msg->vclock_sync == last_recv_ack->vclock_sync)
+	// 	return;
 	static const struct cmsg_hop route[] = {
 		{tx_status_update, NULL}
 	};

And then this script:

--
-- Instance 1
--
-- Step 1
--
box.cfg{
    listen = 3313,
    replication_timeout = 0.1,
}
box.schema.user.grant('guest', 'super')

--
-- Step 3
--
box.error.injection.set("ERRINJ_RELAY_FROM_TX_DELAY", true)



--
-- Instance 2
--
-- Step 2
--
box.cfg{
    listen = 3314,
    replication = 3313,
    replication_timeout = 0.1,
}

It seems we should not push the message when the endpoint is being unpaired. But I can't see how it could crash in real life without my unrealistic change in the code. tx_idle will be always zero when the message just arrived and there was no yields yet. send_vclock check wouldn't pass either - by the time of unpairing the relay isn't sending rows anymore. And the vclock_sync doesn't seem to be related either (don't remember why).

Other similar message handlers do filter out the relay being unpaired, somewhy. Why doesn't this place do the same? How to catch this crash normally?

The text was updated successfully, but these errors were encountered:

CuriousGeorgiy · 2024-08-25T12:44:16Z

@Gerold103 @Serpentian I managed to accidently discover a crash in the relay, when running many instances of the replication-luatest/gh_9918_synchro_queue_additional_info_test.lua test in parallel:

(lldb) bt
* thread #1
  * frame #0: 0x00007ff809edadb6 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007ff809f13ebd libsystem_pthread.dylib`pthread_kill + 262
    frame #2: 0x00007ff809e39a79 libsystem_c.dylib`abort + 126
    frame #3: 0x00007ff809e38d68 libsystem_c.dylib`__assert_rtn + 314
    frame #4: 0x0000000105b7f896 tarantool`cpipe_push_input(pipe=0x00007fee55011b68, msg=0x00007fee55011bc8) at cbus.h:224:2
    frame #5: 0x0000000105b7f5dd tarantool`cpipe_push(pipe=0x00007fee55011b68, msg=0x00007fee55011bc8) at cbus.h:241:2
    frame #6: 0x0000000105b7f58f tarantool`tx_status_update(msg=0x00007fee55011bc8) at relay.cc:640:2
    frame #7: 0x0000000105ce7acd tarantool`cmsg_deliver(msg=0x00007fee55011bc8) at cbus.c:350:2
    frame #8: 0x0000000105ce9be4 tarantool`fiber_pool_f(ap=0x0000000107c12228) at fiber_pool.c:65:3
    frame #9: 0x00000001059d7eea tarantool`fiber_cxx_invoke(f=(tarantool`fiber_pool_f at fiber_pool.c:38), ap=0x0000000107c12228) at fiber.h:1324:10
    frame #10: 0x0000000105cdbd51 tarantool`fiber_loop(data=0x0000000000000000) at fiber.c:1162:18
    frame #11: 0x00000001060eea2e tarantool`coro_init at coro.c:108:3

Coredump: https://drive.google.com/file/d/1DnA6vps_V3TbD93_YmDNxA27D_KpGqwP/view?usp=sharing
Tarantool binary (macOS): https://drive.google.com/file/d/1kGMkg4YUIM8hgbJRstrbtNuMU_4nrpR0/view?usp=sharing

Maybe it is a different bug, if so, please open a new issue.

Gerold103 · 2024-08-26T19:42:52Z

Nope, looks like this bug. Very interesting how did it manage to happen.

The bug was discovered during the development of the quorum promote feature -- basically, it causes qpromote_several_outstanding_promotes_test.lua to fail spontaneously. Furthermore, we suspect that this is the underlying cause of lost heartbeats and subsequent severe replication lags. Long story short, whenever we want to send a message from the TX thread back to a relay thread, we should first check if they are still connected. Otherwise, we'll see * An assertion failure in debug, or * (Presumably) a relay hangup in release due to `if (status_msg->msg.route != NULL) return;` in relay_status_update() -> relay_check_status_needs_update(). The upstream is aware of this issue: tarantool#9920 Backtrace: ``` __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78 0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 0x00007f891f6334c3 in __GI_abort () at abort.c:79 0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224, function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94 0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224, function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber.h:1283 0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085 0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108 ``` Relevant frames: ``` 0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224 224 assert(loop() == pipe->producer); (gdb) 0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241 241 cpipe_push_input(pipe, msg); (gdb) 0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629 629 cpipe_push(&status->relay->relay_pipe, msg); (gdb) 0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553 553 msg->hop->f(msg); (gdb) 0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64 64 cmsg_deliver(msg); (gdb) ``` NO_DOC=internal NO_TEST=internal NO_CHANGELOG=internal

Serpentian · 2025-02-15T00:08:27Z

Apply the following patch:

Diff

**diff --git a/src/box/relay.cc b/src/box/relay.cc
index 9c354a6f8a..dfde3dc9ea 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -604,6 +604,7 @@ tx_status_update(struct cmsg *msg)
 	 * has no result yet, need a PROMOTE.
 	 */
 	raft_process_term(box_raft(), status->term, ack.source);
+	ERROR_INJECT_YIELD(ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY);
 	/*
 	 * Let pending synchronous transactions know, which of
 	 * them were successfully sent to the replica. Acks are
diff --git a/src/lib/core/errinj.h b/src/lib/core/errinj.h
index 420334a2e8..758918b96d 100644
--- a/src/lib/core/errinj.h
+++ b/src/lib/core/errinj.h
@@ -124,6 +124,7 @@ struct errinj {
 	_(ERRINJ_RELAY_REPORT_INTERVAL, ERRINJ_DOUBLE, {.dparam = 0}) \
 	_(ERRINJ_RELAY_SEND_DELAY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE, {.dparam = 0}) \
+	_(ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_RELAY_WAL_START_DELAY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_RELAY_READ_ACK_DELAY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_REPLICASET_VCLOCK, ERRINJ_BOOL, {.bparam = false}) \
**

Now start two instances:

-- Instance 1:
tarantool> box.cfg{listen = 3302, replication = {3301, 3302}}
-- Instance 2:
tarantool> box.cfg{listen = 3301, replication = {3301, 3302}
-- Instance 1:
tarantool> box.error.injection.set('ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY', true)
-- Stop instance 2. Instance 1:
tarantool> box.info.replication
---
- 1:
    id: 1
    uuid: 1c897555-8174-4aa1-b4bd-cfdefd2872f0
    lsn: 2
    name: null
  2:
    id: 2
    uuid: 1ba5fefb-5d25-4a52-84b9-841d97732379
    lsn: 0
    upstream:
      peer: '3301'
      lag: 0.00019454956054688
      status: disconnected
      idle: 4.8211434110017
      message: 'connect, called on fd 23, aka 127.0.0.1:44838: Connection refused'
      system_message: Connection refused
    name: null
    downstream:
      status: stopped
      message: 'unexpected EOF when reading from socket, called on fd 23, aka 127.0.0.1:3302,
        peer of 127.0.0.1:40170: Broken pipe'
      system_message: Broken pipe
...
tarantool> box.error.injection.set('ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY', false)
2025-02-15 03:00:45.767 [263420] main/104/interactive error.cc:391 I> ERRINJ_RELAY_TX_STATUS_UPDATE_DELAY = false
---
- ok
...

tarantool> tarantool: src/lib/core/cbus.h:224: void cpipe_push(struct cpipe *, struct cmsg *): Assertion `pipe->producer == NULL || pipe->producer == loop()' failed.
zsh: IOT instruction (core dumped)  ./src/tarantool -i

(gdb) bt
#0  0x00007ffff4ca4d7c in __pthread_kill_implementation ()
   from /nix/store/9y8pmvk8gdwwznmkzxa6pwyah52xy3nk-glibc-2.38-27/lib/libc.so.6
#1  0x00007ffff4c559c6 in raise ()
   from /nix/store/9y8pmvk8gdwwznmkzxa6pwyah52xy3nk-glibc-2.38-27/lib/libc.so.6
#2  0x00007ffff4c3e8fa in abort ()
   from /nix/store/9y8pmvk8gdwwznmkzxa6pwyah52xy3nk-glibc-2.38-27/lib/libc.so.6
#3  0x00007ffff4c3e819 in __assert_fail_base.cold ()
   from /nix/store/9y8pmvk8gdwwznmkzxa6pwyah52xy3nk-glibc-2.38-27/lib/libc.so.6
#4  0x00007ffff4c4e3c6 in __assert_fail ()
   from /nix/store/9y8pmvk8gdwwznmkzxa6pwyah52xy3nk-glibc-2.38-27/lib/libc.so.6
#5  0x000055555574e818 in cpipe_push (pipe=0x5555563002b8, msg=0x555556300318)
    at /home/serpentian/Programming/tnt/tarantool/src/lib/core/cbus.h:224
#6  0x000055555574e711 in tx_status_update (msg=0x555556300318)
    at /home/serpentian/Programming/tnt/tarantool/src/box/relay.cc:625
#7  0x00005555558acafd in cmsg_deliver (msg=0x555556300318)
    at /home/serpentian/Programming/tnt/tarantool/src/lib/core/cbus.c:360
#8  0x00005555558aec44 in fiber_pool_f (ap=0x7fffe6811718)
    at /home/serpentian/Programming/tnt/tarantool/src/lib/core/fiber_pool.c:65
#9  0x00005555555a1fea in fiber_cxx_invoke (f=0x5555558ae870 <fiber_pool_f>, 
    ap=0x7fffe6811718) at lib/core/fiber.h:1324
#10 0x00005555558a09e5 in fiber_loop (data=0x0)
    at /home/serpentian/Programming/tnt/tarantool/src/lib/core/fiber.c:1167
#11 0x0000555555cdc4a6 in coro_init ()
    at /home/serpentian/Programming/tnt/tarantool/third_party/coro/coro.c:108

Happens, because TX thread may try to send msg to a dead relay. This is the last place in relay, which is not protected with tx.is_paired, so the patch from picodata is correct. But I haven't managed to write 100% repro for it yet.

Gerold103 · 2025-02-18T13:31:44Z

@funbringer, huge thanks again for the fix 💖, and now we have a reliable test, so we will apply your patch! Picodata could also take the test.

In relay there's a message, which travels between relay and tx threads and which is used to deliver vclock of the relay to tx thread. However, when the message is in tx thread, it may happen, that relay thread dies and tx may try to send message back without checking, whether the thread is alive. Let's check `is_paired` flag, which is set to `false`, when relay thread dies, before sending the cbus message. Closes tarantool#9920 NO_DOC=bugfix

In relay there's a message, which travels between relay and tx threads and which is used to deliver vclock of the relay to tx thread. However, when the message is in tx thread, it may happen, that relay thread dies and tx may try to send message back without checking, whether the thread is alive. Let's check `is_paired` flag, which is set to `false`, when relay thread dies, before sending the cbus message. Closes #9920 NO_DOC=bugfix (cherry picked from commit be6fd7d)

Gerold103 added crash bug Something isn't working replication labels Apr 8, 2024

sergepetrenko assigned Serpentian Jul 19, 2024

sergepetrenko moved this to Q2 2025 – Apr-Jun in Tarantool Public Roadmap Dec 25, 2024

sergepetrenko added this to Tarantool Public Roadmap Dec 25, 2024

sergepetrenko removed this from Tarantool Public Roadmap Dec 25, 2024

Serpentian changed the title ~~Relay crash?~~ Relay crash. Tx may send dead relay msg from tx_status_update Feb 15, 2025

Serpentian mentioned this issue Apr 1, 2025

relay: fix crash during shut down #11322

Merged

sergepetrenko added the 2.11 Target is 2.11 and all newer release/master branches label Apr 3, 2025

sergepetrenko closed this as completed in #11322 Apr 3, 2025

sergepetrenko closed this as completed in be6fd7d Apr 3, 2025

TarantoolBot mentioned this issue Apr 3, 2025

[backport 2.11] relay: fix crash during shut down #11332

Merged

TarantoolBot mentioned this issue Apr 3, 2025

[backport 3.2] relay: fix crash during shut down #11333

Merged

TarantoolBot mentioned this issue Apr 3, 2025

[backport 3.3] relay: fix crash during shut down #11334

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relay crash. Tx may send dead relay msg from `tx_status_update` #9920

Relay crash. Tx may send dead relay msg from `tx_status_update` #9920

Relay crash. Tx may send dead relay msg from tx_status_update #9920

Relay crash. Tx may send dead relay msg from tx_status_update #9920

Comments

Relay crash. Tx may send dead relay msg from `tx_status_update` #9920

Relay crash. Tx may send dead relay msg from `tx_status_update` #9920