10000 relay: buffer sends to save on syscalls · Issue #10161 · tarantool/tarantool · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

relay: buffer sends to save on syscalls #10161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sergepetrenko opened this issue Jun 24, 2024 · 0 comments · Fixed by #9922
Closed

relay: buffer sends to save on syscalls #10161

sergepetrenko opened this issue Jun 24, 2024 · 0 comments · Fixed by #9922
Assignees
Labels
feature A new functionality

Comments

@sergepetrenko
Copy link
Collaborator

Currently there is one write() per row in relay. This needs to be optimized to save on syscalls.

This is a reincarnation of an old ticket created by @kostja.

@sergepetrenko sergepetrenko added the feature A new functionality label Jun 24, 2024
@sergepetrenko sergepetrenko self-assigned this Jun 24, 2024
sergepetrenko added a commit to sergepetrenko/small that referenced this issue Jun 24, 2024
lsregion is a good candidate for an output buffer, maybe even better
than obuf:

Obuf may be freed / flushed only completely, meaning you need two obufs
to be able to simultaneously flush one of them and write to the other.

At the same time, lsregion may be gradually freed up to the last flushed
position without blocking the incoming writes.

In order to use lsregion as an output buffer, introduce
lsregion_to_iovec method, which will simply enumerate all the allocated
slabs and their sizes and save their contents to the provided iovec
array.

Since the last flushed slab may still be used by further allocations
(which may happen while the flushed data is being written), introduce a
lsregion_svp helper struct, which tracks the last flushed position.

Last flushed slab is determined by flush_id - max_id used for
allocations in this slab at the moment of flush.

Compared to obuf (which stores data in a set of <= 32 progressively
growing iovecs), it's theoretically possible to overflow the IOV_MAX
(1024) by using lsregion, and thus require multiple writev calls to
flush it, but given that lsregion is usually created with runtime slab
arenam, which allocates 4 megabyte slabs to store the data, it's highly
unlikely. It would require allocating 4 gigabytes of data per one flush.

Needed for tarantool/tarantool#10161
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jun 24, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes tarantool#10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jun 24, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes tarantool#10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
sergepetrenko added a commit to sergepetrenko/small that referenced this issue Jun 25, 2024
lsregion is a good candidate for an output buffer, maybe even better
than obuf:

Obuf may be freed / flushed only completely, meaning you need two obufs
to be able to simultaneously flush one of them and write to the other.

At the same time, lsregion may be gradually freed up to the last flushed
position without blocking the incoming writes.

In order to use lsregion as an output buffer, introduce
lsregion_to_iovec method, which will simply enumerate all the allocated
slabs and their sizes and save their contents to the provided iovec
array.

Since the last flushed slab may still be used by further allocations
(which may happen while the flushed data is being written), introduce a
lsregion_svp helper struct, which tracks the last flushed position.

Last flushed slab is determined by flush_id - max_id used for
allocations in this slab at the moment of flush.

Compared to obuf (which stores data in a set of <= 32 progressively
growing iovecs), it's theoretically possible to overflow the IOV_MAX
(1024) by using lsregion, and thus require multiple writev calls to
flush it, but given that lsregion is usually created with runtime slab
arenam, which allocates 4 megabyte slabs to store the data, it's highly
unlikely. It would require allocating 4 gigabytes of data per one flush.

Needed for tarantool/tarantool#10161
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jun 25, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes tarantool#10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
sergepetrenko added a commit to sergepetrenko/small that referenced this issue Jun 28, 2024
lsregion is a good candidate for an output buffer, maybe even better
than obuf:

Obuf may be freed / flushed only completely, meaning you need two obufs
to be able to simultaneously flush one of them and write to the other.

At the same time, lsregion may be gradually freed up to the last flushed
position without blocking the incoming writes.

In order to use lsregion as an output buffer, introduce
lsregion_to_iovec method, which will simply enumerate all the allocated
slabs and their sizes and save their contents to the provided iovec
array.

Since the last flushed slab may still be used by further allocations
(which may happen while the flushed data is being written), introduce a
lsregion_svp helper struct, which tracks the last flushed position.

Last flushed slab is determined by flush_id - max_id used for
allocations in this slab at the moment of flush.

Compared to obuf (which stores data in a set of <= 32 progressively
growing iovecs), it's theoretically possible to overflow the IOV_MAX
(1024) by using lsregion, and thus require multiple writev calls to
flush it, but given that lsregion is usually created with runtime slab
arenam, which allocates 4 megabyte slabs to store the data, it's highly
unlikely. It would require allocating 4 gigabytes of data per one flush.

Needed for tarantool/tarantool#10161
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jun 28, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes tarantool#10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 2, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes tarantool#10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
sergepetrenko added a commit to sergepetrenko/small that referenced this issue Jul 2, 2024
lsregion is a good candidate for an output buffer, maybe even better
than obuf:

Obuf may be freed / flushed only completely, meaning you need two obufs
to be able to simultaneously flush one of them and write to the other.

At the same time, lsregion may be gradually freed up to the last flushed
position without blocking the incoming writes.

In order to use lsregion as an output buffer, introduce
lsregion_to_iovec method, which will simply enumerate all the allocated
slabs and their sizes and save their contents to the provided iovec
array.

Since the last flushed slab may still be used by further allocations
(which may happen while the flushed data is being written), introduce a
lsregion_svp helper struct, which tracks the last flushed position.

Last flushed slab is determined by flush_id - max_id used for
allocations in this slab at the moment of flush.

Compared to obuf (which stores data in a set of <= 32 progressively
growing iovecs), it's theoretically possible to overflow the IOV_MAX
(1024) by using lsregion, and thus require multiple writev calls to
flush it, but given that lsregion is usually created with runtime slab
arenam, which allocates 4 megabyte slabs to store the data, it's highly
unlikely. It would require allocating 4 gigabytes of data per one flush.

Needed for tarantool/tarantool#10161
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 2, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes tarantool#10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 2, 2024
Co-authored-by: Elena Shebunyaeva <elena.shebunyaeva@gmail.com>
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 2, 2024
`xrow_header_encode` performs multiple tasks at once:
 * allocates memory for the row header (but not the body)
 * encodes the header contents into the allocated buffer
 * copies the body iovecs (if any) to the out iovec

Let's rename it to `xrow_encode`, and do the same for
`xrow_header_decode` for consistency.

Factor out the header encoding code into `xrow_header_encode` function,
which will only encode the header to the given buffer, but won't be
concerned with other tasks.

Introduce `xrow_header_sizeof`, which returns the real size this header
will occupy once encoded.

In-scope-of tarantool#10161

NO_CHANGELOG=refactoring
NO_TEST=refactoring
NO_DOC=refactoring
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 2, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes tarantool#10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
sergepetrenko added a commit to tarantool/small that referenced this issue Jul 5, 2024
lsregion is a good candidate for an output buffer, maybe even better
than obuf:

Obuf may be freed / flushed only completely, meaning you need two obufs
to be able to simultaneously flush one of them and write to the other.

At the same time, lsregion may be gradually freed up to the last flushed
position without blocking the incoming writes.

In order to use lsregion as an output buffer, introduce
lsregion_to_iovec method, which will simply enumerate all the allocated
slabs and their sizes and save their contents to the provided iovec
array.

Since the last flushed slab may still be used by further allocations
(which may happen while the flushed data is being written), introduce a
lsregion_svp helper struct, which tracks the last flushed position.

Last flushed slab is determined by flush_id - max_id used for
allocations in this slab at the moment of flush.

Compared to obuf (which stores data in a set of <= 32 progressively
growing iovecs), it's theoretically possible to overflow the IOV_MAX
(1024) by using lsregion, and thus require multiple writev calls to
flush it, but given that lsregion is usually created with runtime slab
arenam, which allocates 4 megabyte slabs to store the data, it's highly
unlikely. It would require allocating 4 gigabytes of data per one flush.

Needed for tarantool/tarantool#10161
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 5, 2024
New commits:

* rlist: make its methods accept const arguments
* lsregion: introduce lsregion_to_iovec method

Prerequisite tarantool#10161

NO_CHANGELOG=submodule bump
NO_TEST=submodule bump
NO_DOC=submodule bump
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 5, 2024
`xrow_header_encode` performs multiple tasks at once:
 * allocates memory for the row header (but not the body)
 * encodes the header contents into the allocated buffer
 * copies the body iovecs (if any) to the out iovec

Let's rename it to `xrow_encode`, and do the same for
`xrow_header_decode` for consistency.

Factor out the header encoding code into `xrow_header_encode` function,
which will only encode the header to the given buffer, but won't be
concerned with other tasks.

In-scope-of tarantool#10161

NO_CHANGELOG=refactoring
NO_TEST=refactoring
NO_DOC=refactoring
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 5, 2024
In scope of tarantool#10161

NO_CHANGELOG=internal
NO_TEST=internal
NO_DOC=internal
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 5, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes tarantool#10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 9, 2024
New commits:

* rlist: make its methods accept const arguments
* lsregion: introduce lsregion_to_iovec method

Prerequisite tarantool#10161

NO_CHANGELOG=submodule bump
NO_TEST=submodule bump
NO_DOC=submodule bump
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 9, 2024
`xrow_header_encode` performs multiple tasks at once:
 * allo
10000
cates memory for the row header (but not the body)
 * encodes the header contents into the allocated buffer
 * copies the body iovecs (if any) to the out iovec

Let's rename it to `xrow_encode`, and do the same for
`xrow_header_decode` for consistency.

Factor out the header encoding code into `xrow_header_encode` function,
which will only encode the header to the given buffer, but won't be
concerned with other tasks.

In-scope-of tarantool#10161

NO_CHANGELOG=refactoring
NO_TEST=refactoring
NO_DOC=refactoring
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 9, 2024
In scope of tarantool#10161

NO_CHANGELOG=internal
NO_TEST=internal
NO_DOC=internal
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 9, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes tarantool#10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
sergepetrenko added a commit to sergepetrenko/small that referenced this issue Jul 9, 2024
lsregion_reserve() was never used separately from lsregion_alloc in
our code base, so a separate lsregion_reserve() implementation for asan
wasn't needed.

This is going to change, so let's implement the missing method.

Also reenable basic reserve tests for asan build.

Needed for tarantool/tarantool#10161
sergepetrenko added a commit to sergepetrenko/small that referenced this issue Jul 9, 2024
lsregion_reserve() was never used separately from lsregion_alloc in
our code base, so a separate lsregion_reserve() implementation for asan
wasn't needed.

This is going to change, so let's implement the missing method.

Also reenable basic reserve tests for asan build.

Needed for tarantool/tarantool#10161
sergepetrenko added a commit to sergepetrenko/small that referenced this issue Jul 18, 2024
lsregion_reserve() was never used separately from lsregion_alloc in
our code base, so a separate lsregion_reserve() implementation for asan
wasn't needed.

This is going to change, so let's implement the missing method.

Also reenable basic reserve tests for asan build.

Needed for tarantool/tarantool#10161
sergepetrenko added a commit to sergepetrenko/small that referenced this issue Jul 19, 2024
lsregion_reserve() was never used separately from lsregion_alloc in
our code base, so a separate lsregion_reserve() implementation for asan
wasn't needed.

This is going to change, so let's implement the missing method.

Also reenable basic reserve tests for asan build.

Needed for tarantool/tarantool#10161
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 19, 2024
New commits:

* lsregion: implement lsregion_reserve for asan build

Prerequisite tarantool#10161

NO_CHANGELOG=submodule bump
NO_TEST=submodule bump
NO_DOC=submodule bump
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 19, 2024
`xrow_header_encode` performs multiple tasks at once:
 * allocates memory for the row header (but not the body)
 * encodes the header contents into the allocated buffer
 * copies the body iovecs (if any) to the out iovec

Let's rename it to `xrow_encode`, and do the same for
`xrow_header_decode` for consistency.

Factor out the header encoding code into `xrow_header_encode` function,
which will only encode the header to the given buffer, but won't be
concerned with other tasks.

In-scope-of tarantool#10161

NO_CHANGELOG=refactoring
NO_TEST=refactoring
NO_DOC=refactoring
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 19, 2024
In scope of tarantool#10161

NO_CHANGELOG=internal
NO_TEST=internal
NO_DOC=internal
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 19, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes tarantool#10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
sergepetrenko added a commit to sergepetrenko/small that referenced this issue Jul 23, 2024
lsregion_reserve() was never used separately from lsregion_alloc in
our code base, so a separate lsregion_reserve() implementation for asan
wasn't needed.

This is going to change, so let's implement the missing method.

Also reenable basic reserve tests for asan build.

Needed for tarantool/tarantool#10161
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 23, 2024
New commits:

* lsregion: implement lsregion_reserve for asan build

Prerequisite tarantool#10161

NO_CHANGELOG=submodule bump
NO_TEST=submodule bump
NO_DOC=submodule bump
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 23, 2024
`xrow_header_encode` performs multiple tasks at once:
 * allocates memory for the row header (but not the body)
 * encodes the header contents into the allocated buffer
 * copies the body iovecs (if any) to the out iovec

Let's rename it to `xrow_encode`, and do the same for
`xrow_header_decode` for consistency.

Factor out the header encoding code into `xrow_header_encode` function,
which will only encode the header to the given buffer, but won't be
concerned with other tasks.

In-scope-of tarantool#10161

NO_CHANGELOG=refactoring
NO_TEST=refactoring
NO_DOC=refactoring
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 23, 2024
In scope of tarantool#10161

NO_CHANGELOG=internal
NO_TEST=internal
NO_DOC=internal
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 23, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes tarantool#10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
sergepetrenko added a commit to tarantool/small that referenced this issue Jul 23, 2024
lsregion_reserve() was never used separately from lsregion_alloc in
our code base, so a separate lsregion_reserve() implementation for asan
wasn't needed.

This is going to change, so let's implement the missing method.

Also reenable basic reserve tests for asan build.

Needed for tarantool/tarantool#10161
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 23, 2024
New commits:
* lsregion: implement lsregion_reserve for asan build
* matras: introduce `matras_needs_touch` and `matras_touch_no_check`
* region: fix memleak in ASAN version
* test: fix memory leaks reported by LSAN
* lsregion: implement lsregion_reserve for asan build

Prerequisite tarantool#10161

NO_CHANGELOG=submodule bump
NO_TEST=submodule bump
NO_DOC=submodule bump
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 23, 2024
`xrow_header_encode` performs multiple tasks at once:
 * allocates memory for the row header (but not the body)
 * encodes the header contents into the allocated buffer
 * copies the body iovecs (if any) to the out iovec

Let's rename it to `xrow_encode`, and do the same for
`xrow_header_decode` for consistency.

Factor out the header encoding code into `xrow_header_encode` function,
which will only encode the header to the given buffer, but won't be
concerned with other tasks.

In-scope-of tarantool#10161

NO_CHANGELOG=refactoring
NO_TEST=refactoring
NO_DOC=refactoring
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 23, 2024
In scope of tarantool#10161

NO_CHANGELOG=internal
NO_TEST=internal
NO_DOC=internal
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 23, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes tarantool#10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 23, 2024
New commits:
* test: fix memory leaks reported by LSAN
* region: fix memleak in ASAN version
* matras: introduce `matras_needs_touch` and `matras_touch_no_check`
* lsregion: implement lsregion_reserve for asan build

Prerequisite tarantool#10161

NO_CHANGELOG=submodule bump
NO_TEST=submodule bump
NO_DOC=submodule bump
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 23, 2024
`xrow_header_encode` performs multiple tasks at once:
 * allocates memory for the row header (but not the body)
 * encodes the header contents into the allocated buffer
 * copies the body iovecs (if any) to the out iovec

Let's rename it to `xrow_encode`, and do the same for
`xrow_header_decode` for consistency.

Factor out the header encoding code into `xrow_header_encode` function,
which will only encode the header to the given buffer, but won't be
concerned with other tasks.

In-scope-of tarantool#10161

NO_CHANGELOG=refactoring
NO_TEST=refactoring
NO_DOC=refactoring
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 23, 2024
In scope of tarantool#10161

NO_CHANGELOG=internal
NO_TEST=internal
NO_DOC=internal
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 23, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes tarantool#10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 24, 2024
New commits:
* test: fix memory leaks reported by LSAN
* region: fix memleak in ASAN version
* matras: introduce `matras_needs_touch` and `matras_touch_no_check`
* lsregion: implement lsregion_reserve for asan build

Prerequisite tarantool#10161

NO_CHANGELOG=submodule bump
NO_TEST=submodule bump
NO_DOC=submodule bump
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 24, 2024
`xrow_header_encode` performs multiple tasks at once:
 * allocates memory for the row header (but not the body)
 * encodes the header contents into the allocated buffer
 * copies the body iovecs (if any) to the out iovec

Let's rename it to `xrow_encode`, and do the same for
`xrow_header_decode` for consistency.

Factor out the header encoding code into `xrow_header_encode` function,
which will only encode the header to the given buffer, but won't be
concerned with other tasks.

In-scope-of tarantool#10161

NO_CHANGELOG=refactoring
NO_TEST=refactoring
NO_DOC=refactoring
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 24, 2024
In scope of tarantool#10161

NO_CHANGELOG=internal
NO_TEST=internal
NO_DOC=internal
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 24, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes tarantool#10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
sergepetrenko added a commit that referenced this issue Jul 24, 2024
New commits:
* test: fix memory leaks reported by LSAN
* region: fix memleak in ASAN version
* matras: introduce `matras_needs_touch` and `matras_touch_no_check`
* lsregion: implement lsregion_reserve for asan build

Prerequisite #10161

NO_CHANGELOG=submodule bump
NO_TEST=submodule bump
NO_DOC=submodule bump
sergepetrenko added a commit that referenced this issue Jul 24, 2024
`xrow_header_encode` performs multiple tasks at once:
 * allocates memory for the row header (but not the body)
 * encodes the header contents into the allocated buffer
 * copies the body iovecs (if any) to the out iovec

Let's rename it to `xrow_encode`, and do the same for
`xrow_header_decode` for consistency.

Factor out the header encoding code into `xrow_header_encode` function,
which will only encode the header to the given buffer, but won't be
concerned with other tasks.

In-scope-of #10161

NO_CHANGELOG=refactoring
NO_TEST=refactoring
NO_DOC=refactoring
sergepetrenko added a commit that referenced this issue Jul 24, 2024
In scope of #10161

NO_CHANGELOG=internal
NO_TEST=internal
NO_DOC=internal
sergepetrenko added a commit that referenced this issue Jul 24, 2024
Relay sends rows one by one, issuing one `write()` system call per each
sent row. This hurts performance quite a lot: in the 1mops_write test
[1] on my machine master node is able to reach ~ 1.9 million RPS while
replication performance peaks at ~ 0.5 million RPS.

As shown by a PoC patch, batching the rows allows to remove the
replication bottleneck almost completely. Even sending 2 rows per one
write() call already doubles the performance, and sending 64 rows per
one write() call allows the replica to catch up with master's speed [2].

This patch makes relay buffer rows until the total buffer size reaches a
certain threshold, and then flush everything with a single writev()
call.

The flush threshold is configurable via a new tweak -
xrow_buf_flush_size, which has a default value of 16 kb.

A size threshold is chosen over simple row count, because it's more
flexible for different row sizes: if the user has multi-kilobyte tuples,
batching hundreds of such tuples might overflow the socket's send buffer
size, resulting in extra wait until the send buffer becomes free again.
This extra time might've been used for enqueuing extra rows. The
problem would become even worse for hundred-kilobyte tuples and so on.

At the same time the user might have different workload types. One with
large tuples and other with relatively small ones, in which case
batching by size rather than by row count is more predictable
performance wise.

[1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua
[2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926

Closes #10161

NO_DOC=not a documentable feature
NO_TEST=covered by existing tests
locker pushed a commit that referenced this issue Sep 9, 2024
New commits:
* test: fix memory leaks reported by LSAN
* region: fix memleak in ASAN version
* matras: introduce `matras_needs_touch` and `matras_touch_no_check`
* lsregion: implement lsregion_reserve for asan build

Prerequisite #10161

NO_CHANGELOG=submodule bump
NO_TEST=submodule bump
NO_DOC=submodule bump

(cherry picked from commit c191a1b)
CuriousGeorgiy added a commit to CuriousGeorgiy/tarantool that referenced this issue Sep 22, 2024
To overcome the throughput limitation of synchronous transactions let's
allow to commit them using asynchronous wait modes (2c66624). Such
transactions will have the same consistency guarantees with
`read-confirmed` and `linearizable` isolation levels. Changes made this way
can be observable with the `read-committed` isolation level.

Currently, we return an error when the synchronous queue is full for the
only supported synchronous transaction commit mode, `complete`. However,
following the journal queue waiting logic, let's return an error only for
the `none` commit mode. Otherwise, we will wait for space in the
synchronous queue.

Closes tarantool#10583

@TarantoolBot document
Title: Asynchronous commit modes for synchronous transactions
Product: Tarantool
Since: 3.3
Root document: https://www.tarantool.io/en/doc/latest/reference/reference_lua/box_txn_management/commit/
and https://www.tarantool.io/ru/doc/latest/reference/reference_lua/box_txn_management/atomic/
and https://www.tarantool.io/en/doc/latest/platform/replication/repl_sync/
lsregion: implement lsregion_reserve for asan build

lsregion_reserve() was never used separately from lsregion_alloc in
our code base, so a separate lsregion_reserve() implementation for asan
wasn't needed.

This is going to change, so let's implement the missing method.

Also reenable basic reserve tests for asan build.

Needed for tarantool#10161
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant
0