Speed up synchronous transaction queue processing #9917

sergepetrenko · 2024-04-08T14:39:47Z

Performance of synchronous replication degrades too much with the increase of parallel synchronous requests.
For example, on my machine with 6000 requests in parallel (box.info.synchro.queue.len reads 6000) master was only able to pull about 8000 RPS, and messages like this were frequent in the log:

2024-04-08 17:23:11.047 [43663] main txn.c:830 W> too long WAL write: 1 rows at LSN 11794161: 0.520 sec

The issue manifested itself even with replication_synchro_quorum = 1, meaning it's not related to network delay, also the size of the quorum didn't influence the results too much. It seems the problem lies in the way synchronous transactions are processed in the queue.

Besides, when trying the same 6000 concurrent requests to replace something in an async space, the RPS was as high as 300k, meaning the issue isn't related to batch finalization of transactions.

Most likely the cause of degradation is the way our txn_limbo_ack traverses the whole transaction list. In the example above txn_limbo_ack is always called with lsn of the last of 6000 transactions, but it still traverses the whole list and assigns ack_count separately to each transaction. We might improve this: persist an array of lsn's of acks, once the ack_lsn is increased - find the point up to which everything should be committed via binary search, for example.

The text was updated successfully, but these errors were encountered:

Astronomax · 2024-06-02T21:06:02Z

I don't understand how an array of lsns of acks will help. If you mean limbo->vclock, which already exists, then I understand. But this is not enough, we need to use something like std::deque instead of rlist to store txn_limbo_entry because we need to be able to random access in O(1) time. In addition, std::deque has all the necessary advantages: push back, pop back/front (for rolling back and confirming) in O(1) time amortized and pointers to elements are not invalidated after push.

CuriousGeorgiy · 2024-06-11T15:37:25Z

It doesn't seem like we need a deque, AFAIC, all we need is a ring buffer.

AFAIC, the idea was the following. We want to simplify this loop, which updates ack_counts and finds confirm_lsn:

tarantool/src/box/txn_limbo.c

Lines 686 to 704 in 9b63ced

    
           	rlist_foreach_entry(e, &limbo->queue, in_queue) { 
        
           		assert(e->ack_count <= VCLOCK_MAX); 
        
           		if (e->lsn > lsn) 
        
           			break; 
        
           		/* 
        
           		 * Sync transactions need to collect acks. Async 
        
           		 * transactions are automatically committed right 
        
           		 * after all the previous sync transactions are. 
        
           		 */ 
        
           		if (!txn_has_flag(e->txn, TXN_WAIT_ACK)) { 
        
           			continue; 
        
           		} else if (e->lsn <= prev_lsn) { 
        
           			continue; 
        
           		} else if (++e->ack_count < replication_synchro_quorum) { 
        
           			continue; 
        
           		} else { 
        
           			confirm_lsn = e->lsn; 
        
           		} 
        
           	}

Let's store a ring buffer of LSNs present in the limbo along with their ack_counts. The LSNs in this buffer are ordered. Essentially, all we need to do is find two points: prev_lsn and confirm_lsn (two binary searches, the second one being on a subarray) — and update the ack_counts for those LSNs.

We always remove confirmed entries from the beginning of the limbo:

tarantool/src/box/txn_limbo.c

Line 184 in 9b63ced

assert(txn_limbo_first_entry(limbo) == entry);

So once we have confirm_lsn, we simply need to advance the beginning of the ring buffer to the entry following the one with confirm_lsn.

Astronomax · 2024-06-11T19:57:25Z

@CuriousGeorgiy ring buffer looks like a good solution in this case. However, the implementation that already exists in tarantool given in fifo.h is not good enough for this case. The main problem is that push does not work in constant time. There is a magic constant FIFO_WATERMARK:

tarantool/src/lib/salad/fifo.h

Line 35 in 9b63ced

#define FIFO_WATERMARK (512 * sizeof(void*))

There is also code here that moves the entire used block to the beginning if it turns out that more than 256 pops were produced after the last memmove on push operation.

tarantool/src/lib/salad/fifo.h

Lines 76 to 85 in 9b63ced

    
           fifo_push(struct fifo *q, void *ptr) 
        
           { 
        
           	/* reduce memory allocation and memmove 
        
           	 * effect by reusing free pointers buffer space only after the 
        
           	 * watermark frees reached. */ 
        
           	if (unlikely(q->bottom >= FIFO_WATERMARK)) { 
        
           		memmove(q->buf, q->buf + q->bottom, q->bottom); 
        
           		q->top -= q->bottom; 
        
           		q->bottom = 0; 
        
           	}

Let's imagine the following pattern of accessing fifo.
N push operations, 256 pop operations, 1 push, 256 pop operations, 1 push, 256 pop operations, 1 push, ...
It is clear that in this case this will work in O(N^2) asympototically. Not O(N) as we would like.
Not to mention that the existing api does not provide pop from the end.
One more problem. You suggest storing lsn + ack_count as an element. However, the current implementation in fifo.h only stores void* as an element. Therefore, you will have to allocate memory for one element and store the pointer in fifo. However, this is very expensive and I would like to allocate memory in some small blocks for several elements. However, this requires reworking the existing api.
Basically I think it should be a good solution if we implement "ring-buffer" (something like) like this:

This will be a truly looped ring buffer, unlike what we have now in fifo.h. (will not affect asymptotics, but in practice it will give some slight increase of performance)
We will also reallocate the ring buffer to a length of size*2 for each overflow.
We will not shift the work block to the beginning whenever some constant amount of memory is freed at the beginning.
The size of the stored element will be specified in the "template" using #define macro as done in heap.h.

Asymptotically, this should work not worse than deque. Amortized constant time on push, constant time on pop front/back. I propose to implement both versions: on the deque and on such a "ring buffer" and measure the performance in each case.

Gerold103 · 2024-06-11T20:46:36Z

What about this solution? Lets ack only the first txn. When it gathers quorum, lets calculate the actual max confirmed LSN from limbo->vclock. Then instead we do O(1) on each ack. And on final confiming ack we do O(R+T) ops (R = replica count, N = txn count, the ops are find top confirmed LSN among R replicas, and confirm N txns). Instead of doing O(N) increments on each ack and then another O(N) on final confirming ack.

In total: one increment per ack + max O(32) checks on final ack (32 - max replica count, actually even 30 I think). Then no need for any circular buffers, sorting, allocations, etc.

Astronomax · 2024-06-11T21:37:37Z

@Gerold103 mostly you're right. Let Q be the quorum. You proposed an algorithm in which confirmation of N transactions will work in O(R*N) time. The idea of incrementing ack_count of the first transaction is good, without it it would work in O(Q*R*N) (if we go through limbo->vclock for each ACK in O(R)). However, note that even the current approach (that currently in master) used, which goes transaction by transaction and increments ack_count, confirms N transactions in O(Q*N) time (each transaction will be examined in the loop exactly Q times before leaving limbo). And this is already no worse than what you propose.

Binsearch to confirm N transactions in the worst case will work in O(Q*R*(log(N) + log(N - 1) + ... + log(1))) = O(Q*R*log(N!)) (if we don’t lucky and transactions are confirmed one at a time). And it's worse than O(R*Q*N) and also than O(Q*N). However, in reality, confirmations most often come to some fairly large prefixes at a time. Therefore, on average, in practice, binsearch should probably perform better.

Therefore, in practice, any of these approaches may turn out to be better for non-asymptotic reasons (for example, because of how this approach works with the cache).

CuriousGeorgiy · 2024-06-12T00:53:45Z

@Astronomax

However, the implementation that already exists in tarantool given in fifo.h is not good enough for this case.

How about writing some adapter for ibuf?

However, in reality, confirmations most often come to some fairly large prefixes at a time. Therefore, on average, in practice, binsearch should probably perform better.

O(R * N) is also the upper bound estimate for the solution @Gerold103 proposes — AFAIC, on average (i.e., in practice) it will be O(1) for the same reason.

Astronomax · 2024-06-12T01:33:29Z

I'm sorry, I just forgot that in the binsearch, the internal check will still run through limbo->vclock for O(R). Therefore, the solution proposed by @Gerold103 in any case works faster than binsearch. It just does strictly fewer operations per each ACK. There's no need for a binsearch at all.

Gerold103 · 2024-06-12T16:52:09Z

However, in reality, confirmations most often come to some fairly large prefixes at a time

This is exactly the idea. For any prefix length we walk it strictly once. When it is actually confirmed.
On prefix of length 1 my approach would be worse than what we have (because need to check R replicas), but it is max 32 comparisons. Any sort of other data structures, heap allocations, would likely factually cost more, even though algorithmically they might look better.

Another idea I just had - we could store replicas latest LSNs sorted in an rb tree. Max R items in it. Then on each ack we bump the node corresponding to that replica (increase its lsn and update node's position in the tree). And walk Q nodes from max to min. To see which LSN has the quorum. That on each ack would be O(log(Q)) to update the tree + O(Q) to check the current max quorumed LSN. Meaning in total O(Q) asymptotically, if I am not mistaken. But I really think the stupid linear walk of R replicas on each ACK is also fine given R is <= 30 by design. Mostly it will be < 10, I think, or what are the typical setups.

Could also measure in benchmarks. Must be easy to compare.

Astronomax · 2024-06-12T18:57:53Z

Another idea I just had - we could store replicas latest LSNs sorted in an rb tree.

I thought about this too. And I also thought that this is unnecessary, because the cluster size is limited to too small a number.
But if we'll try to do something like this, then it is better to use not an RB tree, but a Cartesian tree. Then it will be possible to update replica's latest LSN in O(log(R)). And also in O(log(R)) it will be possible to search for the current max quorumed LSN (max quorumed LSN = (R-Q)-th order statistics on the limbo->vclock). Meaning in total O(log(R)) asymptotically.
We just need to maintain the number of vertices in a subtree (this is easily recalculated in O(log(R)) when updating almost any similar trees). Then we can look for the k-th order statistic in O(log(R)) simply by going down and looking at the number of vertices in the left and right subtree. Most likely, the same thing can be supported in RB-tree or AVL or any other self-balancing tree. But I know for sure that this can be easily done in a Cartesian tree.
But even in this case, I think the stupid linear walk of R replicas on each ACK will work faster. The overhead from using such structures will most likely be too noticeable with such cluster sizes.

Could also measure in benchmarks. Must be easy to compare.

Yes, I think we need to try both approaches.

Astronomax · 2024-06-23T22:02:01Z

One of the problems was that limbo is a list. And we did not remember anywhere the node that we had already reached the last time we called txn_limbo_ack when we received the previous ACK from the replica. The worst case is when a node confirms transactions one at a time (this is guaranteed to happen on the master node in the case when each fiber creates one transaction). As a result, the confirmation asymptotics is O(Q*N^2).
Also, if you encounter a transaction with lsn = -1, the cycle can be interrupted, because all subsequent transactions will also have lsn = -1. This also gives a very large increase in performance.

This can be fixed with the following simple patch:

diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index dd5aeec1b..fb383e8e9 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -53,6 +53,8 @@ txn_limbo_create(struct txn_limbo *limbo)
 	limbo->promote_greatest_term = 0;
 	latch_create(&limbo->promote_latch);
 	limbo->confirmed_lsn = 0;
+	for(int i = 0; i < VCLOCK_MAX; i++)
+		limbo->confirmed[i] = &limbo->queue;
 	limbo->rollback_count = 0;
 	limbo->is_in_rollback = false;
 	limbo->svp_confirmed_lsn = -1;
@@ -182,6 +184,12 @@ txn_limbo_remove(struct txn_limbo *limbo, struct txn_limbo_entry *entry)
 {
 	assert(!rlist_empty(&entry->in_queue));
 	assert(txn_limbo_first_entry(limbo) == entry);
+	for (int i = 0; i < VCLOCK_MAX; i++) {
+		if (limbo->confirmed[i] == &entry->in_queue) {
+			assert(limbo->confirmed[i]->prev == &limbo->queue);
+			limbo->confirmed[i] = limbo->confirmed[i]->prev;
+		}
+	}
 	rlist_del_entry(entry, in_queue);
 	limbo->len--;
 }
@@ -193,6 +201,9 @@ txn_limbo_pop(struct txn_limbo *limbo, struct txn_limbo_entry *entry)
 	assert(txn_limbo_last_entry(limbo) == entry);
 	assert(entry->is_rollback);
 
+	for (int i = 0; i < VCLOCK_MAX; i++)
+		if (limbo->confirmed[i] == &entry->in_queue)
+			limbo->confirmed[i] = limbo->confirmed[i]->prev;
 	rlist_del_entry(entry, in_queue);
 	limbo->len--;
 	++limbo->rollback_count;
@@ -681,11 +692,15 @@ txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn)
 	if (lsn == prev_lsn)
 		return;
 	vclock_follow(&limbo->vclock, replica_id, lsn);
-	struct txn_limbo_entry *e;
+	if (limbo->confirmed[replica_id]->next == &limbo->queue)
+		return;
+	struct txn_limbo_entry *e =
+		rlist_entry(limbo->confirmed[replica_id]->next,
+			    struct txn_limbo_entry, in_queue);
 	int64_t confirm_lsn = -1;
-	rlist_foreach_entry(e, &limbo->queue, in_queue) {
+	for (; !rlist_entry_is_head(e, &limbo->queue, in_queue); e = rlist_next_entry(e, in_queue)) {
 		assert(e->ack_count <= VCLOCK_MAX);
-		if (e->lsn > lsn)
+		if (e->lsn > lsn || e->lsn == -1)
 			break;
 		/*
 		 * Sync transactions need to collect acks. Async
@@ -702,6 +717,7 @@ txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn)
 			confirm_lsn = e->lsn;
 		}
 	}
+	limbo->confirmed[replica_id] = e->in_queue.prev;
 	if (confirm_lsn == -1 || confirm_lsn <= limbo->confirmed_lsn)
 		return;
 	txn_limbo_write_confirm(limbo, confirm_lsn);
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index 36d6a7f69..654d5b14c 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -168,6 +168,11 @@ struct txn_limbo {
 	 * illegal.
 	 */
 	int64_t confirmed_lsn;
+	/**
+	 * A pointer to the list node corresponding to the last confirmed entry
+	 * for each replica.
+	 */
+	struct rlist *confirmed[VCLOCK_MAX];
 	/**
 	 * Total number of performed rollbacks. It used as a guard
 	 * to do some actions assuming all limbo transactions will

Now the asymptotics of confirming N transactions is O(MAXR * N). This can be optimized to O(Q*N) and then this can be optimized further, as we have already discussed.
In the case when replication_synchro_quorum = 1 it works very quickly now:

a-kuzdnets@a-kuzdnets:~/dev/tarantool$ ./build/src/tarantool perf/lua/1mops_write.lua --nodes=1 --fibers=6000 --ops=1000000 --transaction=1 --warmup=10 --sync
# making 1000000 REPLACE operations,
# 1 operations per txn,
# using 6000 fibers,
# in a replicaset of 1 nodes,
# using HASH index type
# with WAL mode write
# 
# promoting
# done
# Warmup... done, lsn: 	101123
# master done 1890885 ops in time: 7.198743, cpu: 11.153583
# master average speed	262668	ops/sec
# master peak speed	322021	ops/sec
1mops_master_rps	322021

In the case when replication_synchro_quorum = 2 it does not work as fast, but about 8 times faster than before.
Before:

a-kuzdnets@a-kuzdnets:~/dev/tarantool$ perf/lua/1mops_write.lua --nodes=2 --fibers=6000 --ops=1000000 --transaction=1 --warmup=10 --sync
# making 1000000 REPLACE operations,
# 1 operations per txn,
# using 6000 fibers,
# in a replicaset of 2 nodes,
# using HASH index type
# with WAL mode write
# 
# starting 1 replicas
# replication	1
# promoting
# done
# Warmup... done, lsn: 	101950
# master done 894395 ops in time: 79.955543, cpu: 85.692745
# master average speed	11186	ops/sec
# master peak speed	11719	ops/sec
1mops_master_rps	11719
# replicas done 894395 ops in time: 79.955577, cpu: 85.692777
1mops_replica_rps	11186

After:

a-kuzdnets@a-kuzdnets:~/dev/tarantool$ ./build/src/tarantool perf/lua/1mops_write.lua --nodes=2 --fibers=6000 --ops=1000000 --transaction=1 --warmup=10 --sync
# making 1000000 REPLACE operations,
# 1 operations per txn,
# using 6000 fibers,
# in a replicaset of 2 nodes,
# using HASH index type
# with WAL mode write
# 
# starting 1 replicas
# replication	1
# promoting
# done
# Warmup... done, lsn: 	104824
# master done 911494 ops in time: 10.871526, cpu: 24.327279
# master average speed	83842	ops/sec
# master peak speed	211651	ops/sec
1mops_master_rps	211651
# replicas done 911494 ops in time: 10.871579, cpu: 24.327330
1mops_replica_rps	83841

Astronomax · 2024-06-23T22:13:39Z

Perf report before (txn_limbo_ack was at the top of the list):

Samples: 82K of event 'cycles:P', Event count (approx.): 64080043609
  Children      Self  Command          Shared Object         Symbol
+   66,98%     0,15%  tarantool        tarantool             [.] lj_BC_FUNCC
+   52,71%     0,00%  tarantool        tarantool             [.] lbox_commit
+   52,71%     0,01%  tarantool        tarantool             [.] box_txn_commit
+   52,68%     0,03%  tarantool        tarantool             [.] txn_commit
+   51,59%    22,45%  tarantool        tarantool             [.] txn_limbo_ack
+   30,20%    29,17%  tarantool        tarantool             [.] txn_has_flag
+   17,01%     6,60%  tarantool        libc.so.6             [.] __memset_evex_unaligned_erms
+   13,76%     0,00%  tarantool        [unknown]             [.] 0x00000000413883b8
+   13,63%     0,10%  tarantool        [kernel.kallsyms]     [k] asm_exc_page_fault
+   11,29%     0,00%  tarantool        tarantool             [.] lbox_fiber_create
+   11,28%     0,00%  tarantool        tarantool             [.] fiber_create

And now it is at the bottom.
replication_synchro_quorum = 1:

+    6,93%     0,21%  wal        tarantool             [.] xlog_tx_write_plain                              
+    6,92%     0,00%  tarantool  tarantool             [.] slab_put                                          
+    6,70%     0,16%  tarantool  [kernel.kallsyms]     [k] do_anonymous_page                        
+    6,56%     0,26%  tarantool  tarantool             [.] txn_limbo_ack                                  
+    6,48%     0,11%  wal        tarantool             [.] fio_writevn                                      
+    6,08%     0,07%  wal        tarantool             [.] fio_batch_write                             
+    6,04%     0,12%  wal        libc.so.6             [.] __GI___writev

replication_synchro_quorum = 2:

+    1,00%     0,08%  tarantool        [kernel.kallsyms]     [k] __wake_up_common                       
+    1,00%     0,29%  wal              tarantool             [.] xrow_header_encode                     
+    1,00%     0,10%  tarantool        [kernel.kallsyms]     [k] __mem_cgroup_charge                 
+    0,99%     0,13%  tarantool        tarantool             [.] txn_limbo_ack                          
+    0,98%     0,01%  tarantool        tarantool             [.] txn_run_wal_write_triggers              
+    0,98%     0,10%  tarantool        [kernel.kallsyms]     [k] ep_poll_callback                      
+    0,97%     0,01%  tarantool        [kernel.kallsyms]     [k] schedule_hrtimeout_range

Now txn_limbo_ack is not a bottleneck.

Gerold103 · 2024-06-24T19:56:12Z

Nice, great findings 🔥! The only drawback that I can foresee is the manipulations with raw rlist pointers - that requires careful watching when a node is deleted to update its pointer in this list. Also if you get an ACK for, lets say, N txns and the quorum is Q, you still do N * Q increments and then N times you walk the list of max 32 rlist pointers.

While with the approach of incrementing only the first you would for a batch of size N do only N increments, and walk the list of max 32 rlist pointers just once, no? Have you tried this?

Astronomax · 2024-07-01T06:26:03Z

@Gerold103

Unfortunately, it is impossible to avoid manipulations with raw rlist pointers. Because confirmed entries are not deleted immediately after confirmation. And we have to keep the "first synchronous transaction" in the correct state everywhere, and recalculate in the same places. But in both approaches, you can transfer these manipulations to one place - to txn_limbo_ack. (If you store lsn additionally to this pointers and compare them in txn_limbo_ack with the lsns of the first and last entries in limbo, you can understand how to recalculate these pointers. For example, it will be possible to understand that this entry has already been confirmed and deleted from limbo, or rolled back and deleted from limbo.)
This doesn't really make any more increments than the previous approach. And in many cases, much less. And this is a serious advantage of this approach. In the previous approach, we really inevitably make N * Q increments, regardless of how transactions are confirmed.
In this approach, there is no need to traverse the list of at most 32 rlist pointers N times. But the previous approach can be modified to traverse this list exactly once for each confirmed or rolled back block.
For this approach to work as well as the previous one, you will have to implement vclock_kth_stat so that it runs in O(R). I would consider this a disadvantage of this approach. But for now I implemented this naively in O(R^2) for a test on small clusters.

diff --git a/src/box/txn.c b/src/box/txn.c
index 1257f0305..2eeb85fd5 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -733,6 +733,7 @@ txn_complete_fail(struct txn *txn)
 	assert(in_txn() == txn);
 	if (txn->limbo_entry != NULL) {
 		assert(txn_has_flag(txn, TXN_WAIT_SYNC));
+		txn_limbo_on_rollback(&txn_limbo, txn->limbo_entry);
 		txn_limbo_abort(&txn_limbo, txn->limbo_entry);
 		txn->limbo_entry = NULL;
 	}
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index dd5aeec1b..123ddc16f 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -53,6 +53,8 @@ txn_limbo_create(struct txn_limbo *limbo)
 	limbo->promote_greatest_term = 0;
 	latch_create(&limbo->promote_latch);
 	limbo->confirmed_lsn = 0;
+	limbo->confirmed_node = &limbo->queue;
+	limbo->confirmed_node_valid = false;
 	limbo->rollback_count = 0;
 	limbo->is_in_rollback = false;
 	limbo->svp_confirmed_lsn = -1;
@@ -168,7 +170,6 @@ txn_limbo_append(struct txn_limbo *limbo, uint32_t id, struct txn *txn)
 	}
 	e->txn = txn;
 	e->lsn = -1;
-	e->ack_count = 0;
 	e->is_commit = false;
 	e->is_rollback = false;
 	e->insertion_time = fiber_clock();
@@ -246,13 +247,7 @@ txn_limbo_assign_local_lsn(struct txn_limbo *limbo,
 	 * replicas. Update the ACK counter to take them into
 	 * account.
 	 */
-	struct vclock_iterator iter;
-	vclock_iterator_init(&iter, &limbo->vclock);
-	int ack_count = 0;
-	vclock_foreach(&iter, vc)
-		ack_count += vc.lsn >= lsn;
-	assert(ack_count >= entry->ack_count);
-	entry->ack_count = ack_count;
+	limbo->confirmed_node_valid = false;
 }
 
 void
@@ -268,6 +263,20 @@ txn_limbo_assign_lsn(struct txn_limbo *limbo, struct txn_limbo_entry *entry,
 static void
 txn_limbo_write_rollback(struct txn_limbo *limbo, int64_t lsn);
 
+void
+txn_limbo_on_rollback(struct txn_limbo *limbo, struct txn_limbo_entry *last_rollback)
+{
+	assert(txn_has_flag(last_rollback->txn, TXN_WAIT_ACK));
+	if (limbo->confirmed_node == &limbo->queue)
+		return;
+	int64_t confirmed_lsn = rlist_entry(limbo->confirmed_node,
+					    struct txn_limbo_entry, in_queue)->lsn;
+	if (last_rollback->lsn <= confirmed_lsn) {
+		txn_limbo.confirmed_node = rlist_prev(&last_rollback->in_queue);
+		limbo->confirmed_node_valid = false;
+	}
+}
+
 int
 txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry)
 {
@@ -320,6 +329,8 @@ txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry)
 	}
 
 	txn_limbo_write_rollback(limbo, entry->lsn);
+	txn_limbo_on_rollback(limbo, entry);
+
 	struct txn_limbo_entry *e, *tmp;
 	rlist_foreach_entry_safe_reverse(e, &limbo->queue,
 					 in_queue, tmp) {
@@ -444,14 +455,31 @@ txn_limbo_write_confirm(struct txn_limbo *limbo, int64_t lsn)
 	txn_limbo_write_synchro(limbo, IPROTO_RAFT_CONFIRM, lsn, 0, NULL);
 }
 
+void
+txn_limbo_on_confirm(struct txn_limbo *limbo, struct txn_limbo_entry *last_confirm)
+{
+	if (limbo->confirmed_node == &limbo->queue)
+		return;
+	int64_t confirmed_lsn =
+		rlist_entry(limbo->confirmed_node,
+			    struct txn_limbo_entry, in_queue)->lsn;
+	assert(confirmed_lsn != -1);
+	if (last_confirm->lsn >= confirmed_lsn) {
+		limbo->confirmed_node = &limbo->queue;
+		limbo->confirmed_node_valid = false;
+	}
+}
+
 /** Confirm all the entries <= @a lsn. */
 static void
 txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
 {
 	assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
 	assert(limbo == &txn_limbo);
-	struct txn_limbo_entry *e, *tmp;
-	rlist_foreach_entry_safe(e, &limbo->queue, in_queue, tmp) {
+	assert(lsn != -1);
+
+	struct txn_limbo_entry *last_confirm = NULL, *e;
+	rlist_foreach_entry(e, &limbo->queue, in_queue) {
 		/*
 		 * Check if it is an async transaction last in the queue. When
 		 * it is last, it does not depend on a not finished sync
@@ -468,7 +496,15 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
 			 */
 			if (e->lsn == -1)
 				break;
-		} else if (e->txn->signature == TXN_SIGNATURE_UNKNOWN) {
+		}
+		last_confirm = e;
+	}
+	if (last_confirm == NULL)
+		return;
+	txn_limbo_on_confirm(limbo, last_confirm);
+	struct txn_limbo_entry *tmp;
+	rlist_foreach_entry_safe(e, &limbo->queue, in_queue, tmp) {
+		if (e->txn->signature == TXN_SIGNATURE_UNKNOWN) {
 			/*
 			 * A transaction might be covered by the CONFIRM even if
 			 * it is not written to WAL yet when it is an async
@@ -509,6 +545,8 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
 		 */
 		assert(e->txn->signature >= 0);
 		txn_limbo_complete(e->txn, true);
+		if (e == last_confirm)
+			break;
 	}
 	/*
 	 * Track CONFIRM lsn on replica in order to detect split-brain by
@@ -553,6 +591,7 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
 	}
 	if (last_rollback == NULL)
 		return;
+	txn_limbo_on_rollback(limbo, last_rollback);
 	rlist_foreach_entry_safe_reverse(e, &limbo->queue, in_queue, tmp) {
 		txn_limbo_abort(limbo, e);
 		txn_clear_flags(e->txn, TXN_WAIT_ACK);
@@ -649,6 +688,49 @@ txn_limbo_read_demote(struct txn_limbo *limbo, int64_t lsn)
 	return txn_limbo_read_promote(limbo, REPLICA_ID_NIL, lsn);
 }
 
+bool
+txn_limbo_confirmed_node_make_valid(struct txn_limbo *limbo)
+{
+	struct txn_limbo_entry *e =
+		rlist_entry(rlist_next(limbo->confirmed_node),
+			    struct txn_limbo_entry, in_queue);
+	if (rlist_entry_is_head(e, &limbo->queue, in_queue))
+		return false;
+	if (!limbo->confirmed_node_valid) {
+		for (; !rlist_entry_is_head(e, &limbo->queue, in_queue);
+		       e = rlist_next_entry(e, in_queue))
+			if (txn_has_flag(e->txn, TXN_WAIT_ACK))
+				break;
+		limbo->confirmed_node = rlist_prev(&e->in_queue);
+		if (rlist_entry_is_head(e, &limbo->queue, in_queue) || e->lsn == -1)
+			return false;
+		limbo->confirmed_node_valid = true;
+		limbo->ack_count = vclock_count_ge(&limbo->vclock, e->lsn);
+	}
+	return true;
+}
+
+int64_t
+txn_limbo_update_confirmed_node(struct txn_limbo *limbo)
+{
+	assert(limbo->confirmed_node_valid);
+	if (limbo->ack_count < replication_synchro_quorum)
+		return -1;
+	struct txn_limbo_entry *e =
+		rlist_entry(rlist_next(limbo->confirmed_node),
+			    struct txn_limbo_entry, in_queue);
+	int32_t k = (int32_t)vclock_size(&limbo->vclock) - replication_synchro_quorum;
+	int64_t confirm_lsn = (k < 0) ? 0 : vclock_kth_stat(&limbo->vclock, k);
+	assert(confirm_lsn != -1);
+	for (; !rlist_entry_is_head(e, &limbo->queue, in_queue);
+	       e = rlist_next_entry(e, in_queue))
+		if (e->lsn == -1 || e->lsn > confirm_lsn)
+			break;
+	limbo->confirmed_node = rlist_prev(&e->in_queue);
+	limbo->confirmed_node_valid = false;
+	return confirm_lsn;
+}
+
 void
 txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn)
 {
@@ -671,6 +753,8 @@ txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn)
 		return;
 	assert(limbo->owner_id != REPLICA_ID_NIL);
 	int64_t prev_lsn = vclock_get(&limbo->vclock, replica_id);
+
+	assert(lsn >= prev_lsn);
 	/*
 	 * One of the reasons why can happen - the remote instance is not
 	 * read-only and wrote something under its own insance_id. For qsync
@@ -681,29 +765,26 @@ txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn)
 	if (lsn == prev_lsn)
 		return;
 	vclock_follow(&limbo->vclock, replica_id, lsn);
-	struct txn_limbo_entry *e;
-	int64_t confirm_lsn = -1;
-	rlist_foreach_entry(e, &limbo->queue, in_queue) {
-		assert(e->ack_count <= VCLOCK_MAX);
-		if (e->lsn > lsn)
-			break;
-		/*
-		 * Sync transactions need to collect acks. Async
-		 * transactions are automatically committed right
-		 * after all the previous sync transactions are.
-		 */
-		if (!txn_has_flag(e->txn, TXN_WAIT_ACK)) {
-			continue;
-		} else if (e->lsn <= prev_lsn) {
-			continue;
-		} else if (++e->ack_count < replication_synchro_quorum) {
-			continue;
-		} else {
-			confirm_lsn = e->lsn;
-		}
+
+	struct txn_limbo_entry *e = rlist_entry(rlist_next(limbo->confirmed_node),
+						struct txn_limbo_entry, in_queue);
+	if (!limbo->confirmed_node_valid) {
+		if (!txn_limbo_confirmed_node_make_valid(limbo))
+			return;
+		e = rlist_entry(rlist_next(limbo->confirmed_node),
+				struct txn_limbo_entry, in_queue);
+	} else {
+		assert(txn_has_flag(e->txn, TXN_WAIT_ACK));
+		if (e->lsn == -1 || e->lsn <= prev_lsn || lsn < e->lsn)
+			return;
+		++limbo->ack_count;
 	}
-	if (confirm_lsn == -1 || confirm_lsn <= limbo->confirmed_lsn)
+
+	int64_t confirm_lsn = txn_limbo_update_confirmed_node(limbo);
+	if (confirm_lsn == -1)
 		return;
+	assert(confirm_lsn > limbo->confirmed_lsn);
+
 	txn_limbo_write_confirm(limbo, confirm_lsn);
 	txn_limbo_read_confirm(limbo, confirm_lsn);
 }
@@ -1245,23 +1326,19 @@ txn_limbo_on_parameters_change(struct txn_limbo *limbo)
 {
 	if (rlist_empty(&limbo->queue) || txn_limbo_is_frozen(limbo))
 		return;
-	struct txn_limbo_entry *e;
-	int64_t confirm_lsn = -1;
-	rlist_foreach_entry(e, &limbo->queue, in_queue) {
-		assert(e->ack_count <= VCLOCK_MAX);
-		if (!txn_has_flag(e->txn, TXN_WAIT_ACK)) {
-			continue;
-		} else if (e->ack_count < replication_synchro_quorum) {
-			continue;
-		} else {
-			confirm_lsn = e->lsn;
-			assert(confirm_lsn > 0);
-		}
-	}
-	if (confirm_lsn > limbo->confirmed_lsn && !limbo->is_in_rollback) {
+
+	if (!limbo->confirmed_node_valid && !txn_limbo_confirmed_node_make_valid(limbo))
+		goto broadcast;
+
+	int64_t confirm_lsn = txn_limbo_update_confirmed_node(limbo);
+	if (confirm_lsn != -1 && confirm_lsn > limbo->confirmed_lsn &&
+		!limbo->is_in_rollback)
+	{
 		txn_limbo_write_confirm(limbo, confirm_lsn);
 		txn_limbo_read_confirm(limbo, confirm_lsn);
 	}
+
+	broadcast:
 	/*
 	 * Wakeup all the others - timed out will rollback. Also
 	 * there can be non-transactional waiters, such as CONFIRM
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index 36d6a7f69..419734987 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -57,11 +57,6 @@ struct txn_limbo_entry {
 	 * written to WAL yet.
 	 */
 	int64_t lsn;
-	/**
-	 * Number of ACKs. Or in other words - how many replicas
-	 * confirmed receipt of the transaction.
-	 */
-	int ack_count;
 	/**
 	 * Result flags. Only one of them can be true. But both
 	 * can be false if the transaction is still waiting for
@@ -168,6 +163,12 @@ struct txn_limbo {
 	 * illegal.
 	 */
 	int64_t confirmed_lsn;
+
+	struct rlist *confirmed_node;
+
+	bool confirmed_node_valid;
+	
+	int ack_count;
 	/**
 	 * Total number of performed rollbacks. It used as a guard
 	 * to do some actions assuming all limbo transactions will
@@ -296,6 +297,9 @@ txn_limbo_replica_confirmed_lsn(const struct txn_limbo *limbo,
 struct txn_limbo_entry *
 txn_limbo_last_synchro_entry(struct txn_limbo *limbo);
 
+void
+txn_limbo_on_rollback(struct txn_limbo *limbo, struct txn_limbo_entry *entry);
+
 /**
  * Allocate, create, and append a new transaction to the limbo.
  * The limbo entry is allocated on the transaction's region.
diff --git a/src/lib/vclock/vclock.c b/src/lib/vclock/vclock.c
index 81d3eeb50..d465ccbf4 100644
--- a/src/lib/vclock/vclock.c
+++ b/src/lib/vclock/vclock.c
@@ -175,4 +175,36 @@ vclockset_node_compare(const struct vclock *a, const struct vclock *b)
 	return res;
 }
 
+int64_t
+vclock_kth_stat(const struct vclock *vclock, uint32_t k)
+{
+	if (k >= vclock_size(vclock))
+		return -1;
+	struct vclock_iterator it1, it2;
+
+	vclock_iterator_init(&it1, vclock);
+	vclock_foreach(&it1, vc1) {
+		uint32_t le = 0, leq = 0;
+		vclock_iterator_init(&it2, vclock);
+		vclock_foreach(&it2, vc2) {
+			le += vc2.lsn < vc1.lsn;
+			leq += vc2.lsn <= vc1.lsn;
+		}
+		if (le <= k && k < leq)
+			return vc1.lsn;
+	}
+	unreachable();
+}
+
+int
+vclock_count_ge(const struct vclock *vclock, int64_t lsn)
+{
+	int count = 0;
+	struct vclock_iterator it;
+	vclock_iterator_init(&it, vclock);
+	vclock_foreach(&it, vc1)
+		count += vc1.lsn >= lsn;
+	return count;
+}
+
 rb_gen(, vclockset_, vclockset_t, struct vclock, link, vclockset_node_compare);
diff --git a/src/lib/vclock/vclock.h b/src/lib/vclock/vclock.h
index 85aa011be..3cdb9b0d7 100644
--- a/src/lib/vclock/vclock.h
+++ b/src/lib/vclock/vclock.h
@@ -481,6 +481,12 @@ vclockset_match(vclockset_t *set, const struct vclock *key)
 	return vclockset_first(set);
 }
 
+int64_t
+vclock_kth_stat(const struct vclock *vclock, uint32_t k);
+
+int
+vclock_count_ge(const struct vclock *vclock, int64_t lsn);
+
 #define vclockset_foreach(set, vclock) \
 	for ((vclock) = vclockset_first(set); \
 	     (vclock) != NULL; \

The results are approximately the same as in the previous approach:
nodes = 1:

a-kuzdnets@a-kuzdnets:~/dev/tarantool$ ./build/src/tarantool perf/lua/1mops_write.lua --nodes=1 --fibers=6000 --ops=1000000 --transaction=1 --warmup=10 --sync
# making 1000000 REPLACE operations,
# 1 operations per txn,
# using 6000 fibers,
# in a replicaset of 1 nodes,
# using HASH index type
# with WAL mode write
# 
# promoting
# done
# Warmup... done, lsn: 	101123
# master done 1890885 ops in time: 7.367708, cpu: 11.334037
# master average speed	256644	ops/sec
# master peak speed	320193	ops/sec
1mops_master_rps	320193

nodes = 2:

a-kuzdnets@a-kuzdnets:~/dev/tarantool$ ./build/src/tarantool perf/lua/1mops_write.lua --nodes=2 --fibers=6000 --ops=1000000 --transaction=1 --warmup=10 --sync
# making 1000000 REPLACE operations,
# 1 operations per txn,
# using 6000 fibers,
# in a replicaset of 2 nodes,
# using HASH index type
# with WAL mode write
# 
# starting 1 replicas
# replication	1
# promoting
# done
# Warmup... done, lsn: 	101999
# master done 919303 ops in time: 10.811803, cpu: 24.635652
# master average speed	85027	ops/sec
# master peak speed	167633	ops/sec
1mops_master_rps	167633
# replicas done 919303 ops in time: 10.811855, cpu: 24.635701
1mops_replica_rps	85027

But in this test, transactions are confirmed one at a time. Therefore, no speedup was expected here compared to the previous results. We need to test it on the case where one fiber writes many transactions, for example. So that transactions are confirmed in large batches:
previous approach (slightly modified, traverse the list of at most 32 rlist pointers exactly once for each confirmed or rolled back block):

a-kuzdnets@a-kuzdnets:~/dev/tarantool$ ./build/src/tarantool perf/lua/1mops_write.lua --nodes=31 --fibers=1 --ops=1200000 --transaction=100000 --warmup=10 --sync
# making 1200000 REPLACE operations,
# 100000 operations per txn,
# using 1 fibers,
# in a replicaset of 31 nodes,
# using HASH index type
# with WAL mode write
# 
# starting 30 replicas
# replication	1
# replication	1
# replication	3
# replication	18
# replication	29
# promoting
# done
# Warmup... # master done 100039 ops in time: 5.473254, cpu: 20.492166
# master average speed	18277	ops/sec
# master peak speed	991812	ops/sec
1mops_master_rps	991812
# replicas done 100039 ops in time: 7.093179, cpu: 22.649534
1mops_replica_rps	14103

current approach:

a-kuzdnets@a-kuzdnets:~/dev/tarantool$ ./build/src/tarantool perf/lua/1mops_write.lua --nodes=31 --fibers=1 --ops=1200000 --transaction=100000 --warmup=10 --sync
# making 1200000 REPLACE operations,
# 100000 operations per txn,
# using 1 fibers,
# in a replicaset of 31 nodes,
# using HASH index type
# with WAL mode write
# 
# starting 30 replicas
# replication	1
# replication	1
# replication	1
# replication	22
# promoting
# done
# Warmup... # master done 100039 ops in time: 5.625423, cpu: 18.874609
# master average speed	17783	ops/sec
# master peak speed	1005546	ops/sec
1mops_master_rps	1005546
# replicas done 100039 ops in time: 5.856255, cpu: 19.025775
1mops_replica_rps	17082

We did not get any speedup, apparently because even in the previous approach, the mechanism for collecting confirmations took up several percent in the perf report. The test is the same as before, transactions are confirmed in large blocks:
previous approach (slightly modified, traverse the list of at most 32 rlist pointers exactly once for each confirmed or rolled back block):

  Children      Self  Command    Shared Object  Symbol
+    1,51%     0,00%  tarantool  tarantool      [.] txn_limbo_read_confirm
+    0,77%     0,01%  tarantool  tarantool      [.] txn_complete_fail
+    0,73%     0,00%  tarantool  tarantool      [.] txn_limbo_read_rollback
     0,03%     0,00%  tarantool  tarantool      [.] txn_limbo_ack

current approach:

  Children      Self  Command    Shared Object  Symbol
+    1,58%     0,00%  tarantool  tarantool      [.] txn_limbo_read_confirm
+    0,55%     0,00%  tarantool  tarantool      [.] txn_complete_fail
+    0,51%     0,00%  tarantool  tarantool      [.] txn_limbo_read_rollback
     0,03%     0,00%  tarantool  tarantool      [.] txn_limbo_ack

P.S. All results given for this and the previous approaches were obtained in Debug build mode.

Gerold103 · 2024-07-01T19:04:40Z

Thanks for checking and for working on this 💪🔥!

Because confirmed entries are not deleted immediately after confirmation

Are you sure? txn_limbo_read_confirm() in the loop calls txn_limbo_remove() for each confirmed entry before actually finishing it. Same with the rollback - txn_limbo_read_rollback() calls txn_limbo_abort() which removes the entry.

The code still feels more complex than it should be, tbh. Confirmed entries do not remain in the limbo after confirmation. There are no even any yields, because the on-commit/rollback triggers can't yield. Or do I miss something?

As "previous" you refer to the way where we store multiple rlist pointers, right? And "current" where you store only one?

All results given for this and the previous approaches were obtained in Debug build mode.

Oh. That might be problematic. Debug vs Release can give quite different results. Debug might only make sense for checking the flamegraphs, but for max RPS Release build is essential.

Astronomax · 2024-07-02T07:45:51Z

txn_limbo_read_confirm() in the loop calls txn_limbo_remove()

This is true, but first txn_limbo_ack calls txn_limbo_write_confirm -> txn_limbo_write_synchro -> synchro_request_write -> wal_write.

tarantool/src/box/wal.c

Lines 1397 to 1405 in 319357d

    
           if (wal_write_async(journal, entry) != 0) 
        
           	return -1; 
        
           assert(!entry->is_complete); 
        
           do { 
        
           	fiber_yield(); 
        
           } while (!entry->is_complete); 
        
           return 0;

In wal_write_async we give the task to the "wal" thread, and then wait in a loop in which we call fiber_yield until the confirmation entry is written.
At this time, we may receive an ACK for some other entry from the replica?
In addition, in the mechanism that ensures the wal_queue_max_size limitation, there is a transfer of control to other fibers. (journal_write -> journal_queue_flush)

As "previous" you refer to the way where we store multiple rlist pointers, right? And "current" where you store only one?

Yes.

Gerold103 · 2024-07-02T19:02:28Z

Ok, I see now. During CONFIRM WAL write the confirmed LSN might move even further forward, and we would have to go beyond the first TXN's ack count to check that.

It feels like we miss some huge simplification here. Like we shouldn't even need to store ack_count in each synchro_entry if we increment just one of them at a time. Perhaps we could store ack_count in the limbo itself, and reset it on each bump of confirmed_lsn?

Also, you said:

We did not get any speedup, apparently because even in the previous approach, the mechanism for collecting confirmations took up several percent in the perf report

But I see in your reports this:
Previous approach:

# replicas done 100039 ops in time: 7.093179, cpu: 22.649534
1mops_replica_rps	14103

New approach:

# replicas done 100039 ops in time: 5.856255, cpu: 19.025775
1mops_replica_rps	17082

That is quite a noticeable speed up, no? 14103 -> 17082. Anyway, need to re-measure in Release to be more realistic.

Astronomax · 2024-07-02T19:39:35Z

I think this can be done. In implementing the new approach, I already store ack_count only inside txn_limbo.
There is one small problem here. We need to know the lsn of the first unconfirmed synchronous transaction in order to understand whether we need to increment ack_count.

+		if (e->lsn == -1 || e->lsn <= prev_lsn || lsn < e->lsn)
+			return;
+		++limbo->ack_count;

But we can rely on the fact that confirmations and, accordingly, ACKs are sent only to the lsn of a synchronous transaction (AFAICS it's really true now).
Then we can actually replace this check with something like:

+		if (limbo->confirmed_lsn < lsn)
+			return;
+		++limbo->ack_count;

But this approach confused me precisely because we rely on some additional property.
We will also most likely have problems with rollbacks. Let's say we have two synchronous transactions in a queue (lsns: 1, 2).
At first limbo->ack_count = 0. Some replica sends ACK on lsn = 2 and then limbo->ack_count = 1. And now two situations are possible: rollback to lsn = 1 and rollback to lsn = 2. In the first case, limbo->ack_count should become 0, in the second it should remain 1. And it seems that we will not be able to distinguish between these two situations unless we know the lsn of the first synchronous transaction.

Gerold103 · 2024-07-03T16:25:35Z

And now two situations are possible: rollback to lsn = 1 and rollback to lsn = 2. In the first case, limbo->ack_count should become 0, in the second it should remain 1.

Nope. If you got ACK for LSN 2, you actually got ack for all LSNs <= 2, including 1. So if rollback happens to 1, then ack_count remains 1.

This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, this was not done optimally, resulting in a possible situation where it was necessary to go through the entire queue again each time when receiving the next ACK from the replica. This was especially noticeable in the case of a large number parallel synchronous requests. In this case, the performance was about 8000 RPS on average. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement

This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement

This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, this was not done optimally, resulting in a possible situation where it was necessary to go through the entire queue again each time when receiving the next ACK from the replica. This was especially noticeable in the case of a large number parallel synchronous requests. In this case, the performance was about 8000 RPS on average. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement

This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement

Two new vclock methods have been added: `vclock_nth_element` and `vclock_count_ge`. * `vclock_nth_element` takes n and returns whatever element would occur in nth position if vclock were sorted. This method is very useful for synchronous replication because it can be used to find out the lsn of the last confirmed transaction - it's simply the result of calling this method with argument {vclock_size - replication_synchro_quorum} (provided that vclock_size >= replication synchro quorum, otherwise it is obvious that no transaction has yet been confirmed). * `vclock_count_ge` takes lsn and returns the number of components whose value is greater than or equal to lsn. This can be useful to understand how many replicas have already received a transaction with a given lsn. Part of tarantool#9917 NO_CHANGELOG=Will be added in another commit NO_DOC=internal

This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement

Two new vclock methods have been added: `vclock_nth_element` and `vclock_count_ge`. * `vclock_nth_element` takes n and returns whatever element would occur in nth position if vclock were sorted. This method is very useful for synchronous replication because it can be used to find out the lsn of the last confirmed transaction - it's simply the result of calling this method with argument {vclock_size - replication_synchro_quorum} (provided that vclock_size >= replication synchro quorum, otherwise it is obvious that no transaction has yet been confirmed). * `vclock_count_ge` takes lsn and returns the number of components whose value is greater than or equal to lsn. This can be useful to understand how many replicas have already received a transaction with a given lsn. Part of #9917 NO_CHANGELOG=Will be added in another commit NO_DOC=internal

This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. E377 For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes #9917 NO_DOC=performance improvement NO_TEST=performance improvement

Two new vclock methods have been added: `vclock_nth_element` and `vclock_count_ge`. * `vclock_nth_element` takes n and returns whatever element would occur in nth position if vclock were sorted. This method is very useful for synchronous replication because it can be used to find out the lsn of the last confirmed transaction - it's simply the result of calling this method with argument {vclock_size - replication_synchro_quorum} (provided that vclock_size >= replication synchro quorum, otherwise it is obvious that no transaction has yet been confirmed). * `vclock_count_ge` takes lsn and returns the number of components whose value is greater than or equal to lsn. This can be useful to understand how many replicas have already received a transaction with a given lsn. Part of tarantool#9917 NO_CHANGELOG=Will be added in another commit NO_DOC=internal (cherry picked from commit 58f3c93)

This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement (cherry picked from commit 4a866f6)

Two new vclock methods have been added: `vclock_nth_element` and `vclock_count_ge`. * `vclock_nth_element` takes n and returns whatever element would occur in nth position if vclock were sorted. This method is very useful for synchronous replication because it can be used to find out the lsn of the last confirmed transaction - it's simply the result of calling this method with argument {vclock_size - replication_synchro_quorum} (provided that vclock_size >= replication synchro quorum, otherwise it is obvious that no transaction has yet been confirmed). * `vclock_count_ge` takes lsn and returns the number of components whose value is greater than or equal to lsn. This can be useful to understand how many replicas have already received a transaction with a given lsn. Part of tarantool#9917 NO_CHANGELOG=Will be added in another commit NO_DOC=internal (cherry picked from commit 58f3c93)

This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement (cherry picked from commit 4a866f6)

Two new vclock methods have been added: `vclock_nth_element` and `vclock_count_ge`. * `vclock_nth_element` takes n and returns whatever element would occur in nth position if vclock were sorted. This method is very useful for synchronous replication because it can be used to find out the lsn of the last confirmed transaction - it's simply the result of calling this method with argument {vclock_size - replication_synchro_quorum} (provided that vclock_size >= replication synchro quorum, otherwise it is obvious that no transaction has yet been confirmed). * `vclock_count_ge` takes lsn and returns the number of components whose value is greater than or equal to lsn. This can be useful to understand how many replicas have already received a transaction with a given lsn. Part of #9917 NO_CHANGELOG=Will be added in another commit NO_DOC=internal (cherry picked from commit 58f3c93)

This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes #9917 NO_DOC=performance improvement NO_TEST=performance improvement (cherry picked from commit 4a866f6)

Two new vclock methods have been added: `vclock_nth_element` and `vclock_count_ge`. * `vclock_nth_element` takes n and returns whatever element would occur in nth position if vclock were sorted. This method is very useful for synchronous replication because it can be used to find out the lsn of the last confirmed transaction - it's simply the result of calling this method with argument {vclock_size - replication_synchro_quorum} (provided that vclock_size >= replication synchro quorum, otherwise it is obvious that no transaction has yet been confirmed). * `vclock_count_ge` takes lsn and returns the number of components whose value is greater than or equal to lsn. This can be useful to understand how many replicas have already received a transaction with a given lsn. Part of #9917 NO_CHANGELOG=Will be added in another commit NO_DOC=internal (cherry picked from commit 58f3c93)

This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes #9917 NO_DOC=performance improvement NO_TEST=performance improvement (cherry picked from commit 4a866f6)

sergepetrenko added bug Something isn't working performance qsync replication and removed bug Something isn't working labels Apr 8, 2024

CuriousGeorgiy self-assigned this May 15, 2024

Astronomax assigned Astronomax and unassigned CuriousGeorgiy Jun 3, 2024

Astronomax mentioned this issue Jul 6, 2024

limbo: speed up synchronous transaction queue processing #10209

Merged

sergepetrenko closed this as completed in #10209 Oct 4, 2024

sergepetrenko added the 2.11 Target is 2.11 and all newer release/master branches label Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up synchronous transaction queue processing #9917

Speed up synchronous transaction queue processing #9917

Speed up synchronous transaction queue processing #9917

Speed up synchronous transaction queue processing #9917

Comments