relay: buffer sends to save on syscalls #10161

sergepetrenko · 2024-06-24T09:47:53Z

Currently there is one write() per row in relay. This needs to be optimized to save on syscalls.

This is a reincarnation of an old ticket created by @kostja.

lsregion is a good candidate for an output buffer, maybe even better than obuf: Obuf may be freed / flushed only completely, meaning you need two obufs to be able to simultaneously flush one of them and write to the other. At the same time, lsregion may be gradually freed up to the last flushed position without blocking the incoming writes. In order to use lsregion as an output buffer, introduce lsregion_to_iovec method, which will simply enumerate all the allocated slabs and their sizes and save their contents to the provided iovec array. Since the last flushed slab may still be used by further allocations (which may happen while the flushed data is being written), introduce a lsregion_svp helper struct, which tracks the last flushed position. Last flushed slab is determined by flush_id - max_id used for allocations in this slab at the moment of flush. Compared to obuf (which stores data in a set of <= 32 progressively growing iovecs), it's theoretically possible to overflow the IOV_MAX (1024) by using lsregion, and thus require multiple writev calls to flush it, but given that lsregion is usually created with runtime slab arenam, which allocates 4 megabyte slabs to store the data, it's highly unlikely. It would require allocating 4 gigabytes of data per one flush. Needed for tarantool/tarantool#10161

Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests

lsregion is a good candidate for an output buffer, maybe even better than obuf: Obuf may be freed / flushed only completely, meaning you need two obufs to be able to simultaneously flush one of them and write to the other. At the same time, lsregion may be gradually freed up to the last flushed position without blocking the incoming writes. In order to use lsregion as an output buffer, introduce lsregion_to_iovec method, which will simply enumerate all the allocated slabs and their sizes and save their contents to the provided iovec array. Since the last flushed slab may still be used by further allocations (which may happen while the flushed data is being written), introduce a lsregion_svp helper struct, which tracks the last flushed position. Last flushed slab is determined by flush_id - max_id used for allocations in this slab at the moment of flush. Compared to obuf (which stores data in a set of <= 32 progressively growing iovecs), it's theoretically possible to overflow the IOV_MAX (1024) by using lsregion, and thus require multiple writev calls to flush it, but given that lsregion is usually created with runtime slab arenam, which allocates 4 megabyte slabs to store the data, it's highly unlikely. It would require allocating 4 gigabytes of data per one flush. Needed for tarantool/tarantool#10161

Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests

lsregion is a good candidate for an output buffer, maybe even better than obuf: Obuf may be freed / flushed only completely, meaning you need two obufs to be able to simultaneously flush one of them and write to the other. At the same time, lsregion may be gradually freed up to the last flushed position without blocking the incoming writes. In order to use lsregion as an output buffer, introduce lsregion_to_iovec method, which will simply enumerate all the allocated slabs and their sizes and save their contents to the provided iovec array. Since the last flushed slab may still be used by further allocations (which may happen while the flushed data is being written), introduce a lsregion_svp helper struct, which tracks the last flushed position. Last flushed slab is determined by flush_id - max_id used for allocations in this slab at the moment of flush. Compared to obuf (which stores data in a set of <= 32 progressively growing iovecs), it's theoretically possible to overflow the IOV_MAX (1024) by using lsregion, and thus require multiple writev calls to flush it, but given that lsregion is usually created with runtime slab arenam, which allocates 4 megabyte slabs to store the data, it's highly unlikely. It would require allocating 4 gigabytes of data per one flush. Needed for tarantool/tarantool#10161

Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests

lsregion is a good candidate for an output buffer, maybe even better than obuf: Obuf may be freed / flushed only completely, meaning you need two obufs to be able to simultaneously flush one of them and write to the other. At the same time, lsregion may be gradually freed up to the last flushed position without blocking the incoming writes. In order to use lsregion as an output buffer, introduce lsregion_to_iovec method, which will simply enumerate all the allocated slabs and their sizes and save their contents to the provided iovec array. Since the last flushed slab may still be used by further allocations (which may happen while the flushed data is being written), introduce a lsregion_svp helper struct, which tracks the last flushed position. Last flushed slab is determined by flush_id - max_id used for allocations in this slab at the moment of flush. Compared to obuf (which stores data in a set of <= 32 progressively growing iovecs), it's theoretically possible to overflow the IOV_MAX (1024) by using lsregion, and thus require multiple writev calls to flush it, but given that lsregion is usually created with runtime slab arenam, which allocates 4 megabyte slabs to store the data, it's highly unlikely. It would require allocating 4 gigabytes of data per one flush. Needed for tarantool/tarantool#10161

Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests

Co-authored-by: Elena Shebunyaeva <elena.shebunyaeva@gmail.com>

`xrow_header_encode` performs multiple tasks at once: * allocates memory for the row header (but not the body) * encodes the header contents into the allocated buffer * copies the body iovecs (if any) to the out iovec Let's rename it to `xrow_encode`, and do the same for `xrow_header_decode` for consistency. Factor out the header encoding code into `xrow_header_encode` function, which will only encode the header to the given buffer, but won't be concerned with other tasks. Introduce `xrow_header_sizeof`, which returns the real size this header will occupy once encoded. In-scope-of tarantool#10161 NO_CHANGELOG=refactoring NO_TEST=refactoring NO_DOC=refactoring

Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests

lsregion is a good candidate for an output buffer, maybe even better than obuf: Obuf may be freed / flushed only completely, meaning you need two obufs to be able to simultaneously flush one of them and write to the other. At the same time, lsregion may be gradually freed up to the last flushed position without blocking the incoming writes. In order to use lsregion as an output buffer, introduce lsregion_to_iovec method, which will simply enumerate all the allocated slabs and their sizes and save their contents to the provided iovec array. Since the last flushed slab may still be used by further allocations (which may happen while the flushed data is being written), introduce a lsregion_svp helper struct, which tracks the last flushed position. Last flushed slab is determined by flush_id - max_id used for allocations in this slab at the moment of flush. Compared to obuf (which stores data in a set of <= 32 progressively growing iovecs), it's theoretically possible to overflow the IOV_MAX (1024) by using lsregion, and thus require multiple writev calls to flush it, but given that lsregion is usually created with runtime slab arenam, which allocates 4 megabyte slabs to store the data, it's highly unlikely. It would require allocating 4 gigabytes of data per one flush. Needed for tarantool/tarantool#10161

New commits: * rlist: make its methods accept const arguments * lsregion: introduce lsregion_to_iovec method Prerequisite tarantool#10161 NO_CHANGELOG=submodule bump NO_TEST=submodule bump NO_DOC=submodule bump

`xrow_header_encode` performs multiple tasks at once: * allocates memory for the row header (but not the body) * encodes the header contents into the allocated buffer * copies the body iovecs (if any) to the out iovec Let's rename it to `xrow_encode`, and do the same for `xrow_header_decode` for consistency. Factor out the header encoding code into `xrow_header_encode` function, which will only encode the header to the given buffer, but won't be concerned with other tasks. In-scope-of tarantool#10161 NO_CHANGELOG=refactoring NO_TEST=refactoring NO_DOC=refactoring