-
Notifications
You must be signed in to change notification settings - Fork 387
Speed up synchronous transaction queue processing #9917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't understand how an array of lsns of acks will help. If you mean |
It doesn't seem like we need a deque, AFAIC, all we need is a ring buffer. AFAIC, the idea was the following. We want to simplify this loop, which updates Lines 686 to 704 in 9b63ced
Let's store a ring buffer of LSNs present in the limbo along with their We always remove confirmed entries from the beginning of the limbo: Line 184 in 9b63ced
So once we have |
@CuriousGeorgiy ring buffer looks like a good solution in this case. However, the implementation that already exists in tarantool given in tarantool/src/lib/salad/fifo.h Line 35 in 9b63ced
There is also code here that moves the entire used block to the beginning if it turns out that more than 256 pops were produced after the last memmove on push operation.tarantool/src/lib/salad/fifo.h Lines 76 to 85 in 9b63ced
Let's imagine the following pattern of accessing fifo .N push operations, 256 pop operations, 1 push, 256 pop operations, 1 push, 256 pop operations, 1 push, ...It is clear that in this case this will work in O(N^2) asympototically. Not O(N) as we would like.Not to mention that the existing api does not provide pop from the end. One more problem. You suggest storing lsn + ack_count as an element. However, the current implementation in fifo.h only stores void* as an element. Therefore, you will have to allocate memory for one element and store the pointer in fifo . However, this is very expensive and I would like to allocate memory in some small blocks for several elements. However, this requires reworking the existing api.Basically I think it should be a good solution if we implement "ring-buffer" (something like) like this:
Asymptotically, this should work not worse than |
What about this solution? Lets ack only the first txn. When it gathers quorum, lets calculate the actual max confirmed LSN from In total: one increment per ack + max |
@Gerold103 mostly you're right. Let Binsearch to confirm Therefore, in practice, any of these approaches may turn out to be better for non-asymptotic reasons (for example, because of how this approach works with the cache). |
How about writing some adapter for
|
I'm sorry, I just forgot that in the binsearch, the internal check will still run through |
This is exactly the idea. For any prefix length we walk it strictly once. When it is actually confirmed. Another idea I just had - we could store replicas latest LSNs sorted in an rb tree. Max R items in it. Then on each ack we bump the node corresponding to that replica (increase its lsn and update node's position in the tree). And walk Q nodes from max to min. To see which LSN has the quorum. That on each ack would be Could also measure in benchmarks. Must be easy to compare. |
I thought about this too. And I also thought that this is unnecessary, because the cluster size is limited to too small a number.
Yes, I think we need to try both approaches. |
This can be fixed with the following simple patch: diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index dd5aeec1b..fb383e8e9 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -53,6 +53,8 @@ txn_limbo_create(struct txn_limbo *limbo)
limbo->promote_greatest_term = 0;
latch_create(&limbo->promote_latch);
limbo->confirmed_lsn = 0;
+ for(int i = 0; i < VCLOCK_MAX; i++)
+ limbo->confirmed[i] = &limbo->queue;
limbo->rollback_count = 0;
limbo->is_in_rollback = false;
limbo->svp_confirmed_lsn = -1;
@@ -182,6 +184,12 @@ txn_limbo_remove(struct txn_limbo *limbo, struct txn_limbo_entry *entry)
{
assert(!rlist_empty(&entry->in_queue));
assert(txn_limbo_first_entry(limbo) == entry);
+ for (int i = 0; i < VCLOCK_MAX; i++) {
+ if (limbo->confirmed[i] == &entry->in_queue) {
+ assert(limbo->confirmed[i]->prev == &limbo->queue);
+ limbo->confirmed[i] = limbo->confirmed[i]->prev;
+ }
+ }
rlist_del_entry(entry, in_queue);
limbo->len--;
}
@@ -193,6 +201,9 @@ txn_limbo_pop(struct txn_limbo *limbo, struct txn_limbo_entry *entry)
assert(txn_limbo_last_entry(limbo) == entry);
assert(entry->is_rollback);
+ for (int i = 0; i < VCLOCK_MAX; i++)
+ if (limbo->confirmed[i] == &entry->in_queue)
+ limbo->confirmed[i] = limbo->confirmed[i]->prev;
rlist_del_entry(entry, in_queue);
limbo->len--;
++limbo->rollback_count;
@@ -681,11 +692,15 @@ txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn)
if (lsn == prev_lsn)
return;
vclock_follow(&limbo->vclock, replica_id, lsn);
- struct txn_limbo_entry *e;
+ if (limbo->confirmed[replica_id]->next == &limbo->queue)
+ return;
+ struct txn_limbo_entry *e =
+ rlist_entry(limbo->confirmed[replica_id]->next,
+ struct txn_limbo_entry, in_queue);
int64_t confirm_lsn = -1;
- rlist_foreach_entry(e, &limbo->queue, in_queue) {
+ for (; !rlist_entry_is_head(e, &limbo->queue, in_queue); e = rlist_next_entry(e, in_queue)) {
assert(e->ack_count <= VCLOCK_MAX);
- if (e->lsn > lsn)
+ if (e->lsn > lsn || e->lsn == -1)
break;
/*
* Sync transactions need to collect acks. Async
@@ -702,6 +717,7 @@ txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn)
confirm_lsn = e->lsn;
}
}
+ limbo->confirmed[replica_id] = e->in_queue.prev;
if (confirm_lsn == -1 || confirm_lsn <= limbo->confirmed_lsn)
return;
txn_limbo_write_confirm(limbo, confirm_lsn);
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index 36d6a7f69..654d5b14c 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -168,6 +168,11 @@ struct txn_limbo {
* illegal.
*/
int64_t confirmed_lsn;
+ /**
+ * A pointer to the list node corresponding to the last confirmed entry
+ * for each replica.
+ */
+ struct rlist *confirmed[VCLOCK_MAX];
/**
* Total number of performed rollbacks. It used as a guard
* to do some actions assuming all limbo transactions will Now the asymptotics of confirming N transactions is O(MAXR * N). This can be optimized to O(Q*N) and then this can be optimized further, as we have already discussed. a-kuzdnets@a-kuzdnets:~/dev/tarantool$ ./build/src/tarantool perf/lua/1mops_write.lua --nodes=1 --fibers=6000 --ops=1000000 --transaction=1 --warmup=10 --sync
# making 1000000 REPLACE operations,
# 1 operations per txn,
# using 6000 fibers,
# in a replicaset of 1 nodes,
# using HASH index type
# with WAL mode write
#
# promoting
# done
# Warmup... done, lsn: 101123
# master done 1890885 ops in time: 7.198743, cpu: 11.153583
# master average speed 262668 ops/sec
# master peak speed 322021 ops/sec
1mops_master_rps 322021 In the case when a-kuzdnets@a-kuzdnets:~/dev/tarantool$ perf/lua/1mops_write.lua --nodes=2 --fibers=6000 --ops=1000000 --transaction=1 --warmup=10 --sync
# making 1000000 REPLACE operations,
# 1 operations per txn,
# using 6000 fibers,
# in a replicaset of 2 nodes,
# using HASH index type
# with WAL mode write
#
# starting 1 replicas
# replication 1
# promoting
# done
# Warmup... done, lsn: 101950
# master done 894395 ops in time: 79.955543, cpu: 85.692745
# master average speed 11186 ops/sec
# master peak speed 11719 ops/sec
1mops_master_rps 11719
# replicas done 894395 ops in time: 79.955577, cpu: 85.692777
1mops_replica_rps 11186 After: a-kuzdnets@a-kuzdnets:~/dev/tarantool$ ./build/src/tarantool perf/lua/1mops_write.lua --nodes=2 --fibers=6000 --ops=1000000 --transaction=1 --warmup=10 --sync
# making 1000000 REPLACE operations,
# 1 operations per txn,
# using 6000 fibers,
# in a replicaset of 2 nodes,
# using HASH index type
# with WAL mode write
#
# starting 1 replicas
# replication 1
# promoting
# done
# Warmup... done, lsn: 104824
# master done 911494 ops in time: 10.871526, cpu: 24.327279
# master average speed 83842 ops/sec
# master peak speed 211651 ops/sec
1mops_master_rps 211651
# replicas done 911494 ops in time: 10.871579, cpu: 24.327330
1mops_replica_rps 83841 |
Perf report before ( Samples: 82K of event 'cycles:P', Event count (approx.): 64080043609
Children Self Command Shared Object Symbol
+ 66,98% 0,15% tarantool tarantool [.] lj_BC_FUNCC
+ 52,71% 0,00% tarantool tarantool [.] lbox_commit
+ 52,71% 0,01% tarantool tarantool [.] box_txn_commit
+ 52,68% 0,03% tarantool tarantool [.] txn_commit
+ 51,59% 22,45% tarantool tarantool [.] txn_limbo_ack
+ 30,20% 29,17% tarantool tarantool [.] txn_has_flag
+ 17,01% 6,60% tarantool libc.so.6 [.] __memset_evex_unaligned_erms
+ 13,76% 0,00% tarantool [unknown] [.] 0x00000000413883b8
+ 13,63% 0,10% tarantool [kernel.kallsyms] [k] asm_exc_page_fault
+ 11,29% 0,00% tarantool tarantool [.] lbox_fiber_create
+ 11,28% 0,00% tarantool tarantool [.] fiber_create And now it is at the bottom. + 6,93% 0,21% wal tarantool [.] xlog_tx_write_plain
+ 6,92% 0,00% tarantool tarantool [.] slab_put
+ 6,70% 0,16% tarantool [kernel.kallsyms] [k] do_anonymous_page
+ 6,56% 0,26% tarantool tarantool [.] txn_limbo_ack
+ 6,48% 0,11% wal tarantool [.] fio_writevn
+ 6,08% 0,07% wal tarantool [.] fio_batch_write
+ 6,04% 0,12% wal libc.so.6 [.] __GI___writev
+ 1,00% 0,08% tarantool [kernel.kallsyms] [k] __wake_up_common
+ 1,00% 0,29% wal tarantool [.] xrow_header_encode
+ 1,00% 0,10% tarantool [kernel.kallsyms] [k] __mem_cgroup_charge
+ 0,99% 0,13% tarantool tarantool [.] txn_limbo_ack
+ 0,98% 0,01% tarantool tarantool [.] txn_run_wal_write_triggers
+ 0,98% 0,10% tarantool [kernel.kallsyms] [k] ep_poll_callback
+ 0,97% 0,01% tarantool [kernel.kallsyms] [k] schedule_hrtimeout_range Now |
Nice, great findings 🔥! The only drawback that I can foresee is the manipulations with raw rlist pointers - that requires careful watching when a node is deleted to update its pointer in this list. Also if you get an ACK for, lets say, N txns and the quorum is Q, you still do N * Q increments and then N times you walk the list of max 32 rlist pointers. While with the approach of incrementing only the first you would for a batch of size N do only N increments, and walk the list of max 32 rlist pointers just once, no? Have you tried this? |
diff --git a/src/box/txn.c b/src/box/txn.c
index 1257f0305..2eeb85fd5 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -733,6 +733,7 @@ txn_complete_fail(struct txn *txn)
assert(in_txn() == txn);
if (txn->limbo_entry != NULL) {
assert(txn_has_flag(txn, TXN_WAIT_SYNC));
+ txn_limbo_on_rollback(&txn_limbo, txn->limbo_entry);
txn_limbo_abort(&txn_limbo, txn->limbo_entry);
txn->limbo_entry = NULL;
}
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index dd5aeec1b..123ddc16f 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -53,6 +53,8 @@ txn_limbo_create(struct txn_limbo *limbo)
limbo->promote_greatest_term = 0;
latch_create(&limbo->promote_latch);
limbo->confirmed_lsn = 0;
+ limbo->confirmed_node = &limbo->queue;
+ limbo->confirmed_node_valid = false;
limbo->rollback_count = 0;
limbo->is_in_rollback = false;
limbo->svp_confirmed_lsn = -1;
@@ -168,7 +170,6 @@ txn_limbo_append(struct txn_limbo *limbo, uint32_t id, struct txn *txn)
}
e->txn = txn;
e->lsn = -1;
- e->ack_count = 0;
e->is_commit = false;
e->is_rollback = false;
e->insertion_time = fiber_clock();
@@ -246,13 +247,7 @@ txn_limbo_assign_local_lsn(struct txn_limbo *limbo,
* replicas. Update the ACK counter to take them into
* account.
*/
- struct vclock_iterator iter;
- vclock_iterator_init(&iter, &limbo->vclock);
- int ack_count = 0;
- vclock_foreach(&iter, vc)
- ack_count += vc.lsn >= lsn;
- assert(ack_count >= entry->ack_count);
- entry->ack_count = ack_count;
+ limbo->confirmed_node_valid = false;
}
void
@@ -268,6 +263,20 @@ txn_limbo_assign_lsn(struct txn_limbo *limbo, struct txn_limbo_entry *entry,
static void
txn_limbo_write_rollback(struct txn_limbo *limbo, int64_t lsn);
+void
+txn_limbo_on_rollback(struct txn_limbo *limbo, struct txn_limbo_entry *last_rollback)
+{
+ assert(txn_has_flag(last_rollback->txn, TXN_WAIT_ACK));
+ if (limbo->confirmed_node == &limbo->queue)
+ return;
+ int64_t confirmed_lsn = rlist_entry(limbo->confirmed_node,
+ struct txn_limbo_entry, in_queue)->lsn;
+ if (last_rollback->lsn <= confirmed_lsn) {
+ txn_limbo.confirmed_node = rlist_prev(&last_rollback->in_queue);
+ limbo->confirmed_node_valid = false;
+ }
+}
+
int
txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry)
{
@@ -320,6 +329,8 @@ txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry)
}
txn_limbo_write_rollback(limbo, entry->lsn);
+ txn_limbo_on_rollback(limbo, entry);
+
struct txn_limbo_entry *e, *tmp;
rlist_foreach_entry_safe_reverse(e, &limbo->queue,
in_queue, tmp) {
@@ -444,14 +455,31 @@ txn_limbo_write_confirm(struct txn_limbo *limbo, int64_t lsn)
txn_limbo_write_synchro(limbo, IPROTO_RAFT_CONFIRM, lsn, 0, NULL);
}
+void
+txn_limbo_on_confirm(struct txn_limbo *limbo, struct txn_limbo_entry *last_confirm)
+{
+ if (limbo->confirmed_node == &limbo->queue)
+ return;
+ int64_t confirmed_lsn =
+ rlist_entry(limbo->confirmed_node,
+ struct txn_limbo_entry, in_queue)->lsn;
+ assert(confirmed_lsn != -1);
+ if (last_confirm->lsn >= confirmed_lsn) {
+ limbo->confirmed_node = &limbo->queue;
+ limbo->confirmed_node_valid = false;
+ }
+}
+
/** Confirm all the entries <= @a lsn. */
static void
txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
{
assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
assert(limbo == &txn_limbo);
- struct txn_limbo_entry *e, *tmp;
- rlist_foreach_entry_safe(e, &limbo->queue, in_queue, tmp) {
+ assert(lsn != -1);
+
+ struct txn_limbo_entry *last_confirm = NULL, *e;
+ rlist_foreach_entry(e, &limbo->queue, in_queue) {
/*
* Check if it is an async transaction last in the queue. When
* it is last, it does not depend on a not finished sync
@@ -468,7 +496,15 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
*/
if (e->lsn == -1)
break;
- } else if (e->txn->signature == TXN_SIGNATURE_UNKNOWN) {
+ }
+ last_confirm = e;
+ }
+ if (last_confirm == NULL)
+ return;
+ txn_limbo_on_confirm(limbo, last_confirm);
+ struct txn_limbo_entry *tmp;
+ rlist_foreach_entry_safe(e, &limbo->queue, in_queue, tmp) {
+ if (e->txn->signature == TXN_SIGNATURE_UNKNOWN) {
/*
* A transaction might be covered by the CONFIRM even if
* it is not written to WAL yet when it is an async
@@ -509,6 +545,8 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
*/
assert(e->txn->signature >= 0);
txn_limbo_complete(e->txn, true);
+ if (e == last_confirm)
+ break;
}
/*
* Track CONFIRM lsn on replica in order to detect split-brain by
@@ -553,6 +591,7 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
}
if (last_rollback == NULL)
return;
+ txn_limbo_on_rollback(limbo, last_rollback);
rlist_foreach_entry_safe_reverse(e, &limbo->queue, in_queue, tmp) {
txn_limbo_abort(limbo, e);
txn_clear_flags(e->txn, TXN_WAIT_ACK);
@@ -649,6 +688,49 @@ txn_limbo_read_demote(struct txn_limbo *limbo, int64_t lsn)
return txn_limbo_read_promote(limbo, REPLICA_ID_NIL, lsn);
}
+bool
+txn_limbo_confirmed_node_make_valid(struct txn_limbo *limbo)
+{
+ struct txn_limbo_entry *e =
+ rlist_entry(rlist_next(limbo->confirmed_node),
+ struct txn_limbo_entry, in_queue);
+ if (rlist_entry_is_head(e, &limbo->queue, in_queue))
+ return false;
+ if (!limbo->confirmed_node_valid) {
+ for (; !rlist_entry_is_head(e, &limbo->queue, in_queue);
+ e = rlist_next_entry(e, in_queue))
+ if (txn_has_flag(e->txn, TXN_WAIT_ACK))
+ break;
+ limbo->confirmed_node = rlist_prev(&e->in_queue);
+ if (rlist_entry_is_head(e, &limbo->queue, in_queue) || e->lsn == -1)
+ return false;
+ limbo->confirmed_node_valid = true;
+ limbo->ack_count = vclock_count_ge(&limbo->vclock, e->lsn);
+ }
+ return true;
+}
+
+int64_t
+txn_limbo_update_confirmed_node(struct txn_limbo *limbo)
+{
+ assert(limbo->confirmed_node_valid);
+ if (limbo->ack_count < replication_synchro_quorum)
+ return -1;
+ struct txn_limbo_entry *e =
+ rlist_entry(rlist_next(limbo->confirmed_node),
+ struct txn_limbo_entry, in_queue);
+ int32_t k = (int32_t)vclock_size(&limbo->vclock) - replication_synchro_quorum;
+ int64_t confirm_lsn = (k < 0) ? 0 : vclock_kth_stat(&limbo->vclock, k);
+ assert(confirm_lsn != -1);
+ for (; !rlist_entry_is_head(e, &limbo->queue, in_queue);
+ e = rlist_next_entry(e, in_queue))
+ if (e->lsn == -1 || e->lsn > confirm_lsn)
+ break;
+ limbo->confirmed_node = rlist_prev(&e->in_queue);
+ limbo->confirmed_node_valid = false;
+ return confirm_lsn;
+}
+
void
txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn)
{
@@ -671,6 +753,8 @@ txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn)
return;
assert(limbo->owner_id != REPLICA_ID_NIL);
int64_t prev_lsn = vclock_get(&limbo->vclock, replica_id);
+
+ assert(lsn >= prev_lsn);
/*
* One of the reasons why can happen - the remote instance is not
* read-only and wrote something under its own insance_id. For qsync
@@ -681,29 +765,26 @@ txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn)
if (lsn == prev_lsn)
return;
vclock_follow(&limbo->vclock, replica_id, lsn);
- struct txn_limbo_entry *e;
- int64_t confirm_lsn = -1;
- rlist_foreach_entry(e, &limbo->queue, in_queue) {
- assert(e->ack_count <= VCLOCK_MAX);
- if (e->lsn > lsn)
- break;
- /*
- * Sync transactions need to collect acks. Async
- * transactions are automatically committed right
- * after all the previous sync transactions are.
- */
- if (!txn_has_flag(e->txn, TXN_WAIT_ACK)) {
- continue;
- } else if (e->lsn <= prev_lsn) {
- continue;
- } else if (++e->ack_count < replication_synchro_quorum) {
- continue;
- } else {
- confirm_lsn = e->lsn;
- }
+
+ struct txn_limbo_entry *e = rlist_entry(rlist_next(limbo->confirmed_node),
+ struct txn_limbo_entry, in_queue);
+ if (!limbo->confirmed_node_valid) {
+ if (!txn_limbo_confirmed_node_make_valid(limbo))
+ return;
+ e = rlist_entry(rlist_next(limbo->confirmed_node),
+ struct txn_limbo_entry, in_queue);
+ } else {
+ assert(txn_has_flag(e->txn, TXN_WAIT_ACK));
+ if (e->lsn == -1 || e->lsn <= prev_lsn || lsn < e->lsn)
+ return;
+ ++limbo->ack_count;
}
- if (confirm_lsn == -1 || confirm_lsn <= limbo->confirmed_lsn)
+
+ int64_t confirm_lsn = txn_limbo_update_confirmed_node(limbo);
+ if (confirm_lsn == -1)
return;
+ assert(confirm_lsn > limbo->confirmed_lsn);
+
txn_limbo_write_confirm(limbo, confirm_lsn);
txn_limbo_read_confirm(limbo, confirm_lsn);
}
@@ -1245,23 +1326,19 @@ txn_limbo_on_parameters_change(struct txn_limbo *limbo)
{
if (rlist_empty(&limbo->queue) || txn_limbo_is_frozen(limbo))
return;
- struct txn_limbo_entry *e;
- int64_t confirm_lsn = -1;
- rlist_foreach_entry(e, &limbo->queue, in_queue) {
- assert(e->ack_count <= VCLOCK_MAX);
- if (!txn_has_flag(e->txn, TXN_WAIT_ACK)) {
- continue;
- } else if (e->ack_count < replication_synchro_quorum) {
- continue;
- } else {
- confirm_lsn = e->lsn;
- assert(confirm_lsn > 0);
- }
- }
- if (confirm_lsn > limbo->confirmed_lsn && !limbo->is_in_rollback) {
+
+ if (!limbo->confirmed_node_valid && !txn_limbo_confirmed_node_make_valid(limbo))
+ goto broadcast;
+
+ int64_t confirm_lsn = txn_limbo_update_confirmed_node(limbo);
+ if (confirm_lsn != -1 && confirm_lsn > limbo->confirmed_lsn &&
+ !limbo->is_in_rollback)
+ {
txn_limbo_write_confirm(limbo, confirm_lsn);
txn_limbo_read_confirm(limbo, confirm_lsn);
}
+
+ broadcast:
/*
* Wakeup all the others - timed out will rollback. Also
* there can be non-transactional waiters, such as CONFIRM
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index 36d6a7f69..419734987 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -57,11 +57,6 @@ struct txn_limbo_entry {
* written to WAL yet.
*/
int64_t lsn;
- /**
- * Number of ACKs. Or in other words - how many replicas
- * confirmed receipt of the transaction.
- */
- int ack_count;
/**
* Result flags. Only one of them can be true. But both
* can be false if the transaction is still waiting for
@@ -168,6 +163,12 @@ struct txn_limbo {
* illegal.
*/
int64_t confirmed_lsn;
+
+ struct rlist *confirmed_node;
+
+ bool confirmed_node_valid;
+
+ int ack_count;
/**
* Total number of performed rollbacks. It used as a guard
* to do some actions assuming all limbo transactions will
@@ -296,6 +297,9 @@ txn_limbo_replica_confirmed_lsn(const struct txn_limbo *limbo,
struct txn_limbo_entry *
txn_limbo_last_synchro_entry(struct txn_limbo *limbo);
+void
+txn_limbo_on_rollback(struct txn_limbo *limbo, struct txn_limbo_entry *entry);
+
/**
* Allocate, create, and append a new transaction to the limbo.
* The limbo entry is allocated on the transaction's region.
diff --git a/src/lib/vclock/vclock.c b/src/lib/vclock/vclock.c
index 81d3eeb50..d465ccbf4 100644
--- a/src/lib/vclock/vclock.c
+++ b/src/lib/vclock/vclock.c
@@ -175,4 +175,36 @@ vclockset_node_compare(const struct vclock *a, const struct vclock *b)
return res;
}
+int64_t
+vclock_kth_stat(const struct vclock *vclock, uint32_t k)
+{
+ if (k >= vclock_size(vclock))
+ return -1;
+ struct vclock_iterator it1, it2;
+
+ vclock_iterator_init(&it1, vclock);
+ vclock_foreach(&it1, vc1) {
+ uint32_t le = 0, leq = 0;
+ vclock_iterator_init(&it2, vclock);
+ vclock_foreach(&it2, vc2) {
+ le += vc2.lsn < vc1.lsn;
+ leq += vc2.lsn <= vc1.lsn;
+ }
+ if (le <= k && k < leq)
+ return vc1.lsn;
+ }
+ unreachable();
+}
+
+int
+vclock_count_ge(const struct vclock *vclock, int64_t lsn)
+{
+ int count = 0;
+ struct vclock_iterator it;
+ vclock_iterator_init(&it, vclock);
+ vclock_foreach(&it, vc1)
+ count += vc1.lsn >= lsn;
+ return count;
+}
+
rb_gen(, vclockset_, vclockset_t, struct vclock, link, vclockset_node_compare);
diff --git a/src/lib/vclock/vclock.h b/src/lib/vclock/vclock.h
index 85aa011be..3cdb9b0d7 100644
--- a/src/lib/vclock/vclock.h
+++ b/src/lib/vclock/vclock.h
@@ -481,6 +481,12 @@ vclockset_match(vclockset_t *set, const struct vclock *key)
return vclockset_first(set);
}
+int64_t
+vclock_kth_stat(const struct vclock *vclock, uint32_t k);
+
+int
+vclock_count_ge(const struct vclock *vclock, int64_t lsn);
+
#define vclockset_foreach(set, vclock) \
for ((vclock) = vclockset_first(set); \
(vclock) != NULL; \ The results are approximately the same as in the previous approach: a-kuzdnets@a-kuzdnets:~/dev/tarantool$ ./build/src/tarantool perf/lua/1mops_write.lua --nodes=1 --fibers=6000 --ops=1000000 --transaction=1 --warmup=10 --sync
# making 1000000 REPLACE operations,
# 1 operations per txn,
# using 6000 fibers,
# in a replicaset of 1 nodes,
# using HASH index type
# with WAL mode write
#
# promoting
# done
# Warmup... done, lsn: 101123
# master done 1890885 ops in time: 7.367708, cpu: 11.334037
# master average speed 256644 ops/sec
# master peak speed 320193 ops/sec
1mops_master_rps 320193
a-kuzdnets@a-kuzdnets:~/dev/tarantool$ ./build/src/tarantool perf/lua/1mops_write.lua --nodes=2 --fibers=6000 --ops=1000000 --transaction=1 --warmup=10 --sync
# making 1000000 REPLACE operations,
# 1 operations per txn,
# using 6000 fibers,
# in a replicaset of 2 nodes,
# using HASH index type
# with WAL mode write
#
# starting 1 replicas
# replication 1
# promoting
# done
# Warmup... done, lsn: 101999
# master done 919303 ops in time: 10.811803, cpu: 24.635652
# master average speed 85027 ops/sec
# master peak speed 167633 ops/sec
1mops_master_rps 167633
# replicas done 919303 ops in time: 10.811855, cpu: 24.635701
1mops_replica_rps 85027 But in this test, transactions are confirmed one at a time. Therefore, no speedup was expected here compared to the previous results. We need to test it on the case where one fiber writes many transactions, for example. So that transactions are confirmed in large batches: a-kuzdnets@a-kuzdnets:~/dev/tarantool$ ./build/src/tarantool perf/lua/1mops_write.lua --nodes=31 --fibers=1 --ops=1200000 --transaction=100000 --warmup=10 --sync
# making 1200000 REPLACE operations,
# 100000 operations per txn,
# using 1 fibers,
# in a replicaset of 31 nodes,
# using HASH index type
# with WAL mode write
#
# starting 30 replicas
# replication 1
# replication 1
# replication 3
# replication 18
# replication 29
# promoting
# done
# Warmup... # master done 100039 ops in time: 5.473254, cpu: 20.492166
# master average speed 18277 ops/sec
# master peak speed 991812 ops/sec
1mops_master_rps 991812
# replicas done 100039 ops in time: 7.093179, cpu: 22.649534
1mops_replica_rps 14103 current approach: a-kuzdnets@a-kuzdnets:~/dev/tarantool$ ./build/src/tarantool perf/lua/1mops_write.lua --nodes=31 --fibers=1 --ops=1200000 --transaction=100000 --warmup=10 --sync
# making 1200000 REPLACE operations,
# 100000 operations per txn,
# using 1 fibers,
# in a replicaset of 31 nodes,
# using HASH index type
# with WAL mode write
#
# starting 30 replicas
# replication 1
# replication 1
# replication 1
# replication 22
# promoting
# done
# Warmup... # master done 100039 ops in time: 5.625423, cpu: 18.874609
# master average speed 17783 ops/sec
# master peak speed 1005546 ops/sec
1mops_master_rps 1005546
# replicas done 100039 ops in time: 5.856255, cpu: 19.025775
1mops_replica_rps 17082 We did not get any speedup, apparently because even in the previous approach, the mechanism for collecting confirmations took up several percent in the perf report. The test is the same as before, transactions are confirmed in large blocks: Children Self Command Shared Object Symbol
+ 1,51% 0,00% tarantool tarantool [.] txn_limbo_read_confirm
+ 0,77% 0,01% tarantool tarantool [.] txn_complete_fail
+ 0,73% 0,00% tarantool tarantool [.] txn_limbo_read_rollback
0,03% 0,00% tarantool tarantool [.] txn_limbo_ack current approach: Children Self Command Shared Object Symbol
+ 1,58% 0,00% tarantool tarantool [.] txn_limbo_read_confirm
+ 0,55% 0,00% tarantool tarantool [.] txn_complete_fail
+ 0,51% 0,00% tarantool tarantool [.] txn_limbo_read_rollback
0,03% 0,00% tarantool tarantool [.] txn_limbo_ack P.S. All results given for this and the previous approaches were obtained in Debug build mode. |
Thanks for checking and for working on this 💪🔥!
Are you sure? The code still feels more complex than it should be, tbh. Confirmed entries do not remain in the limbo after confirmation. There are no even any yields, because the on-commit/rollback triggers can't yield. Or do I miss something? As "previous" you refer to the way where we store multiple rlist pointers, right? And "current" where you store only one?
Oh. That might be problematic. Debug vs Release can give quite different results. Debug might only make sense for checking the flamegraphs, but for max RPS Release build is essential. |
This is true, but first Lines 1397 to 1405 in 319357d
In wal_write_async we give the task to the "wal" thread, and then wait in a loop in which we call fiber_yield until the confirmation entry is written.At this time, we may receive an ACK for some other entry from the replica? In addition, in the mechanism that ensures the wal_queue_max_size limitation, there is a transfer of control to other fibers. (journal_write -> journal_queue_flush )
Yes. |
Ok, I see now. During It feels like we miss some huge simplification here. Like we shouldn't even need to store Also, you said:
But I see in your reports this:
New approach:
That is quite a noticeable speed up, no? |
I think this can be done. In implementing the new approach, I already store + if (e->lsn == -1 || e->lsn <= prev_lsn || lsn < e->lsn)
+ return;
+ ++limbo->ack_count; But we can rely on the fact that confirmations and, accordingly, ACKs are sent only to the lsn of a synchronous transaction (AFAICS it's really true now). + if (limbo->confirmed_lsn < lsn)
+ return;
+ ++limbo->ack_count; But this approach confused me precisely because we rely on some additional property. |
Nope. If you got ACK for LSN 2, you actually got ack for all LSNs <= 2, including 1. So if rollback happens to 1, then |
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, this was not done optimally, resulting in a possible situation where it was necessary to go through the entire queue again each time when receiving the next ACK from the replica. This was especially noticeable in the case of a large number parallel synchronous requests. In this case, the performance was about 8000 RPS on average. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, this was not done optimally, resulting in a possible situation where it was necessary to go through the entire queue again each time when receiving the next ACK from the replica. This was especially noticeable in the case of a large number parallel synchronous requests. In this case, the performance was about 8000 RPS on average. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, this was not done optimally, resulting in a possible situation where it was necessary to go through the entire queue again each time when receiving the next ACK from the replica. This was especially noticeable in the case of a large number parallel synchronous requests. In this case, the performance was about 8000 RPS on average. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, this was not done optimally, resulting in a possible situation where it was necessary to go through the entire queue again each time when receiving the next ACK from the replica. This was especially noticeable in the case of a large number parallel synchronous requests. In this case, the performance was about 8000 RPS on average. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, this was not done optimally, resulting in a possible situation where it was necessary to go through the entire queue again each time when receiving the next ACK from the replica. This was especially noticeable in the case of a large number parallel synchronous requests. In this case, the performance was about 8000 RPS on average. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
Two new vclock methods have been added: `vclock_nth_element` and `vclock_count_ge`. * `vclock_nth_element` takes n and returns whatever element would occur in nth position if vclock were sorted. This method is very useful for synchronous replication because it can be used to find out the lsn of the last confirmed transaction - it's simply the result of calling this method with argument {vclock_size - replication_synchro_quorum} (provided that vclock_size >= replication synchro quorum, otherwise it is obvious that no transaction has yet been confirmed). * `vclock_count_ge` takes lsn and returns the number of components whose value is greater than or equal to lsn. This can be useful to understand how many replicas have already received a transaction with a given lsn. Part of tarantool#9917 NO_CHANGELOG=Will be added in another commit NO_DOC=internal
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement
Two new vclock methods have been added: `vclock_nth_element` and `vclock_count_ge`. * `vclock_nth_element` takes n and returns whatever element would occur in nth position if vclock were sorted. This method is very useful for synchronous replication because it can be used to find out the lsn of the last confirmed transaction - it's simply the result of calling this method with argument {vclock_size - replication_synchro_quorum} (provided that vclock_size >= replication synchro quorum, otherwise it is obvious that no transaction has yet been confirmed). * `vclock_count_ge` takes lsn and returns the number of components whose value is greater than or equal to lsn. This can be useful to understand how many replicas have already received a transaction with a given lsn. Part of #9917 NO_CHANGELOG=Will be added in another commit NO_DOC=internal
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. E377 For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes #9917 NO_DOC=performance improvement NO_TEST=performance improvement
Two new vclock methods have been added: `vclock_nth_element` and `vclock_count_ge`. * `vclock_nth_element` takes n and returns whatever element would occur in nth position if vclock were sorted. This method is very useful for synchronous replication because it can be used to find out the lsn of the last confirmed transaction - it's simply the result of calling this method with argument {vclock_size - replication_synchro_quorum} (provided that vclock_size >= replication synchro quorum, otherwise it is obvious that no transaction has yet been confirmed). * `vclock_count_ge` takes lsn and returns the number of components whose value is greater than or equal to lsn. This can be useful to understand how many replicas have already received a transaction with a given lsn. Part of tarantool#9917 NO_CHANGELOG=Will be added in another commit NO_DOC=internal (cherry picked from commit 58f3c93)
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement (cherry picked from commit 4a866f6)
Two new vclock methods have been added: `vclock_nth_element` and `vclock_count_ge`. * `vclock_nth_element` takes n and returns whatever element would occur in nth position if vclock were sorted. This method is very useful for synchronous replication because it can be used to find out the lsn of the last confirmed transaction - it's simply the result of calling this method with argument {vclock_size - replication_synchro_quorum} (provided that vclock_size >= replication synchro quorum, otherwise it is obvious that no transaction has yet been confirmed). * `vclock_count_ge` takes lsn and returns the number of components whose value is greater than or equal to lsn. This can be useful to understand how many replicas have already received a transaction with a given lsn. Part of tarantool#9917 NO_CHANGELOG=Will be added in another commit NO_DOC=internal (cherry picked from commit 58f3c93)
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes tarantool#9917 NO_DOC=performance improvement NO_TEST=performance improvement (cherry picked from commit 4a866f6)
Two new vclock methods have been added: `vclock_nth_element` and `vclock_count_ge`. * `vclock_nth_element` takes n and returns whatever element would occur in nth position if vclock were sorted. This method is very useful for synchronous replication because it can be used to find out the lsn of the last confirmed transaction - it's simply the result of calling this method with argument {vclock_size - replication_synchro_quorum} (provided that vclock_size >= replication synchro quorum, otherwise it is obvious that no transaction has yet been confirmed). * `vclock_count_ge` takes lsn and returns the number of components whose value is greater than or equal to lsn. This can be useful to understand how many replicas have already received a transaction with a given lsn. Part of #9917 NO_CHANGELOG=Will be added in another commit NO_DOC=internal (cherry picked from commit 58f3c93)
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes #9917 NO_DOC=performance improvement NO_TEST=performance improvement (cherry picked from commit 4a866f6)
Two new vclock methods have been added: `vclock_nth_element` and `vclock_count_ge`. * `vclock_nth_element` takes n and returns whatever element would occur in nth position if vclock were sorted. This method is very useful for synchronous replication because it can be used to find out the lsn of the last confirmed transaction - it's simply the result of calling this method with argument {vclock_size - replication_synchro_quorum} (provided that vclock_size >= replication synchro quorum, otherwise it is obvious that no transaction has yet been confirmed). * `vclock_count_ge` takes lsn and returns the number of components whose value is greater than or equal to lsn. This can be useful to understand how many replicas have already received a transaction with a given lsn. Part of #9917 NO_CHANGELOG=Will be added in another commit NO_DOC=internal (cherry picked from commit 58f3c93)
This patch optimizes the process of collecting ACKs from replicas for synchronous transactions. Before this patch, collecting confirmations was slow in some cases. There was a possible situation where it was necessary to go through the entire limbo again every time the next ACK was received from the replica. This was especially noticeable in the case of a large number of parallel synchronous requests. For example, in the 1mops_write bench with parameters --fibers=6000 --ops=1000000 --transaction=1, performance increases by 13-18 times on small clusters of 2-4 nodes and 2 times on large clusters of 31 nodes. Closes #9917 NO_DOC=performance improvement NO_TEST=performance improvement (cherry picked from commit 4a866f6)
Performance of synchronous replication degrades too much with the increase of parallel synchronous requests.
For example, on my machine with 6000 requests in parallel (
box.info.synchro.queue.len
reads6000
) master was only able to pull about 8000 RPS, and messages like this were frequent in the log:The issue manifested itself even with
replication_synchro_quorum = 1
, meaning it's not related to network delay, also the size of the quorum didn't influence the results too much. It seems the problem lies in the way synchronous transactions are processed in the queue.Besides, when trying the same 6000 concurrent requests to replace something in an async space, the RPS was as high as 300k, meaning the issue isn't related to batch finalization of transactions.
Most likely the cause of degradation is the way our
txn_limbo_ack
traverses the whole transaction list. In the example abovetxn_limbo_ack
is always called with lsn of the last of 6000 transactions, but it still traverses the whole list and assignsack_count
separately to each transaction. We might improve this: persist an array of lsn's of acks, once the ack_lsn is increased - find the point up to which everything should be committed via binary search, for example.The text was updated successfully, but these errors were encountered: