-
Notifications
You must be signed in to change notification settings - Fork 387
relay: buffer sends to save on syscalls #10161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
feature
A new functionality
Comments
sergepetrenko
added a commit
to sergepetrenko/small
that referenced
this issue
Jun 24, 2024
lsregion is a good candidate for an output buffer, maybe even better than obuf: Obuf may be freed / flushed only completely, meaning you need two obufs to be able to simultaneously flush one of them and write to the other. At the same time, lsregion may be gradually freed up to the last flushed position without blocking the incoming writes. In order to use lsregion as an output buffer, introduce lsregion_to_iovec method, which will simply enumerate all the allocated slabs and their sizes and save their contents to the provided iovec array. Since the last flushed slab may still be used by further allocations (which may happen while the flushed data is being written), introduce a lsregion_svp helper struct, which tracks the last flushed position. Last flushed slab is determined by flush_id - max_id used for allocations in this slab at the moment of flush. Compared to obuf (which stores data in a set of <= 32 progressively growing iovecs), it's theoretically possible to overflow the IOV_MAX (1024) by using lsregion, and thus require multiple writev calls to flush it, but given that lsregion is usually created with runtime slab arenam, which allocates 4 megabyte slabs to store the data, it's highly unlikely. It would require allocating 4 gigabytes of data per one flush. Needed for tarantool/tarantool#10161
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jun 24, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
Merged
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jun 24, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
sergepetrenko
added a commit
to sergepetrenko/small
that referenced
this issue
Jun 25, 2024
lsregion is a good candidate for an output buffer, maybe even better than obuf: Obuf may be freed / flushed only completely, meaning you need two obufs to be able to simultaneously flush one of them and write to the other. At the same time, lsregion may be gradually freed up to the last flushed position without blocking the incoming writes. In order to use lsregion as an output buffer, introduce lsregion_to_iovec method, which will simply enumerate all the allocated slabs and their sizes and save their contents to the provided iovec array. Since the last flushed slab may still be used by further allocations (which may happen while the flushed data is being written), introduce a lsregion_svp helper struct, which tracks the last flushed position. Last flushed slab is determined by flush_id - max_id used for allocations in this slab at the moment of flush. Compared to obuf (which stores data in a set of <= 32 progressively growing iovecs), it's theoretically possible to overflow the IOV_MAX (1024) by using lsregion, and thus require multiple writev calls to flush it, but given that lsregion is usually created with runtime slab arenam, which allocates 4 megabyte slabs to store the data, it's highly unlikely. It would require allocating 4 gigabytes of data per one flush. Needed for tarantool/tarantool#10161
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jun 25, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
sergepetrenko
added a commit
to sergepetrenko/small
that referenced
this issue
Jun 28, 2024
lsregion is a good candidate for an output buffer, maybe even better than obuf: Obuf may be freed / flushed only completely, meaning you need two obufs to be able to simultaneously flush one of them and write to the other. At the same time, lsregion may be gradually freed up to the last flushed position without blocking the incoming writes. In order to use lsregion as an output buffer, introduce lsregion_to_iovec method, which will simply enumerate all the allocated slabs and their sizes and save their contents to the provided iovec array. Since the last flushed slab may still be used by further allocations (which may happen while the flushed data is being written), introduce a lsregion_svp helper struct, which tracks the last flushed position. Last flushed slab is determined by flush_id - max_id used for allocations in this slab at the moment of flush. Compared to obuf (which stores data in a set of <= 32 progressively growing iovecs), it's theoretically possible to overflow the IOV_MAX (1024) by using lsregion, and thus require multiple writev calls to flush it, but given that lsregion is usually created with runtime slab arenam, which allocates 4 megabyte slabs to store the data, it's highly unlikely. It would require allocating 4 gigabytes of data per one flush. Needed for tarantool/tarantool#10161
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jun 28, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 2, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
sergepetrenko
added a commit
to sergepetrenko/small
that referenced
this issue
Jul 2, 2024
lsregion is a good candidate for an output buffer, maybe even better than obuf: Obuf may be freed / flushed only completely, meaning you need two obufs to be able to simultaneously flush one of them and write to the other. At the same time, lsregion may be gradually freed up to the last flushed position without blocking the incoming writes. In order to use lsregion as an output buffer, introduce lsregion_to_iovec method, which will simply enumerate all the allocated slabs and their sizes and save their contents to the provided iovec array. Since the last flushed slab may still be used by further allocations (which may happen while the flushed data is being written), introduce a lsregion_svp helper struct, which tracks the last flushed position. Last flushed slab is determined by flush_id - max_id used for allocations in this slab at the moment of flush. Compared to obuf (which stores data in a set of <= 32 progressively growing iovecs), it's theoretically possible to overflow the IOV_MAX (1024) by using lsregion, and thus require multiple writev calls to flush it, but given that lsregion is usually created with runtime slab arenam, which allocates 4 megabyte slabs to store the data, it's highly unlikely. It would require allocating 4 gigabytes of data per one flush. Needed for tarantool/tarantool#10161
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 2, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 2, 2024
Co-authored-by: Elena Shebunyaeva <elena.shebunyaeva@gmail.com>
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 2, 2024
`xrow_header_encode` performs multiple tasks at once: * allocates memory for the row header (but not the body) * encodes the header contents into the allocated buffer * copies the body iovecs (if any) to the out iovec Let's rename it to `xrow_encode`, and do the same for `xrow_header_decode` for consistency. Factor out the header encoding code into `xrow_header_encode` function, which will only encode the header to the given buffer, but won't be concerned with other tasks. Introduce `xrow_header_sizeof`, which returns the real size this header will occupy once encoded. In-scope-of tarantool#10161 NO_CHANGELOG=refactoring NO_TEST=refactoring NO_DOC=refactoring
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 2, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
sergepetrenko
added a commit
to tarantool/small
that referenced
this issue
Jul 5, 2024
lsregion is a good candidate for an output buffer, maybe even better than obuf: Obuf may be freed / flushed only completely, meaning you need two obufs to be able to simultaneously flush one of them and write to the other. At the same time, lsregion may be gradually freed up to the last flushed position without blocking the incoming writes. In order to use lsregion as an output buffer, introduce lsregion_to_iovec method, which will simply enumerate all the allocated slabs and their sizes and save their contents to the provided iovec array. Since the last flushed slab may still be used by further allocations (which may happen while the flushed data is being written), introduce a lsregion_svp helper struct, which tracks the last flushed position. Last flushed slab is determined by flush_id - max_id used for allocations in this slab at the moment of flush. Compared to obuf (which stores data in a set of <= 32 progressively growing iovecs), it's theoretically possible to overflow the IOV_MAX (1024) by using lsregion, and thus require multiple writev calls to flush it, but given that lsregion is usually created with runtime slab arenam, which allocates 4 megabyte slabs to store the data, it's highly unlikely. It would require allocating 4 gigabytes of data per one flush. Needed for tarantool/tarantool#10161
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 5, 2024
New commits: * rlist: make its methods accept const arguments * lsregion: introduce lsregion_to_iovec method Prerequisite tarantool#10161 NO_CHANGELOG=submodule bump NO_TEST=submodule bump NO_DOC=submodule bump
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 5, 2024
`xrow_header_encode` performs multiple tasks at once: * allocates memory for the row header (but not the body) * encodes the header contents into the allocated buffer * copies the body iovecs (if any) to the out iovec Let's rename it to `xrow_encode`, and do the same for `xrow_header_decode` for consistency. Factor out the header encoding code into `xrow_header_encode` function, which will only encode the header to the given buffer, but won't be concerned with other tasks. In-scope-of tarantool#10161 NO_CHANGELOG=refactoring NO_TEST=refactoring NO_DOC=refactoring
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 5, 2024
In scope of tarantool#10161 NO_CHANGELOG=internal NO_TEST=internal NO_DOC=internal
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 5, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 9, 2024
New commits: * rlist: make its methods accept const arguments * lsregion: introduce lsregion_to_iovec method Prerequisite tarantool#10161 NO_CHANGELOG=submodule bump NO_TEST=submodule bump NO_DOC=submodule bump
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 9, 2024
`xrow_header_encode` performs multiple tasks at once: * allo 10000 cates memory for the row header (but not the body) * encodes the header contents into the allocated buffer * copies the body iovecs (if any) to the out iovec Let's rename it to `xrow_encode`, and do the same for `xrow_header_decode` for consistency. Factor out the header encoding code into `xrow_header_encode` function, which will only encode the header to the given buffer, but won't be concerned with other tasks. In-scope-of tarantool#10161 NO_CHANGELOG=refactoring NO_TEST=refactoring NO_DOC=refactoring
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 9, 2024
In scope of tarantool#10161 NO_CHANGELOG=internal NO_TEST=internal NO_DOC=internal
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 9, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
sergepetrenko
added a commit
to sergepetrenko/small
that referenced
this issue
Jul 9, 2024
lsregion_reserve() was never used separately from lsregion_alloc in our code base, so a separate lsregion_reserve() implementation for asan wasn't needed. This is going to change, so let's implement the missing method. Also reenable basic reserve tests for asan build. Needed for tarantool/tarantool#10161
sergepetrenko
added a commit
to sergepetrenko/small
that referenced
this issue
Jul 9, 2024
lsregion_reserve() was never used separately from lsregion_alloc in our code base, so a separate lsregion_reserve() implementation for asan wasn't needed. This is going to change, so let's implement the missing method. Also reenable basic reserve tests for asan build. Needed for tarantool/tarantool#10161
sergepetrenko
added a commit
to sergepetrenko/small
that referenced
this issue
Jul 18, 2024
lsregion_reserve() was never used separately from lsregion_alloc in our code base, so a separate lsregion_reserve() implementation for asan wasn't needed. This is going to change, so let's implement the missing method. Also reenable basic reserve tests for asan build. Needed for tarantool/tarantool#10161
sergepetrenko
added a commit
to sergepetrenko/small
that referenced
this issue
Jul 19, 2024
lsregion_reserve() was never used separately from lsregion_alloc in our code base, so a separate lsregion_reserve() implementation for asan wasn't needed. This is going to change, so let's implement the missing method. Also reenable basic reserve tests for asan build. Needed for tarantool/tarantool#10161
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 19, 2024
New commits: * lsregion: implement lsregion_reserve for asan build Prerequisite tarantool#10161 NO_CHANGELOG=submodule bump NO_TEST=submodule bump NO_DOC=submodule bump
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 19, 2024
`xrow_header_encode` performs multiple tasks at once: * allocates memory for the row header (but not the body) * encodes the header contents into the allocated buffer * copies the body iovecs (if any) to the out iovec Let's rename it to `xrow_encode`, and do the same for `xrow_header_decode` for consistency. Factor out the header encoding code into `xrow_header_encode` function, which will only encode the header to the given buffer, but won't be concerned with other tasks. In-scope-of tarantool#10161 NO_CHANGELOG=refactoring NO_TEST=refactoring NO_DOC=refactoring
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 19, 2024
In scope of tarantool#10161 NO_CHANGELOG=internal NO_TEST=internal NO_DOC=internal
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 19, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
sergepetrenko
added a commit
to sergepetrenko/small
that referenced
this issue
Jul 23, 2024
lsregion_reserve() was never used separately from lsregion_alloc in our code base, so a separate lsregion_reserve() implementation for asan wasn't needed. This is going to change, so let's implement the missing method. Also reenable basic reserve tests for asan build. Needed for tarantool/tarantool#10161
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 23, 2024
New commits: * lsregion: implement lsregion_reserve for asan build Prerequisite tarantool#10161 NO_CHANGELOG=submodule bump NO_TEST=submodule bump NO_DOC=submodule bump
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 23, 2024
`xrow_header_encode` performs multiple tasks at once: * allocates memory for the row header (but not the body) * encodes the header contents into the allocated buffer * copies the body iovecs (if any) to the out iovec Let's rename it to `xrow_encode`, and do the same for `xrow_header_decode` for consistency. Factor out the header encoding code into `xrow_header_encode` function, which will only encode the header to the given buffer, but won't be concerned with other tasks. In-scope-of tarantool#10161 NO_CHANGELOG=refactoring NO_TEST=refactoring NO_DOC=refactoring
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 23, 2024
In scope of tarantool#10161 NO_CHANGELOG=internal NO_TEST=internal NO_DOC=internal
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 23, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
sergepetrenko
added a commit
to tarantool/small
that referenced
this issue
Jul 23, 2024
lsregion_reserve() was never used separately from lsregion_alloc in our code base, so a separate lsregion_reserve() implementation for asan wasn't needed. This is going to change, so let's implement the missing method. Also reenable basic reserve tests for asan build. Needed for tarantool/tarantool#10161
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 23, 2024
New commits: * lsregion: implement lsregion_reserve for asan build * matras: introduce `matras_needs_touch` and `matras_touch_no_check` * region: fix memleak in ASAN version * test: fix memory leaks reported by LSAN * lsregion: implement lsregion_reserve for asan build Prerequisite tarantool#10161 NO_CHANGELOG=submodule bump NO_TEST=submodule bump NO_DOC=submodule bump
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 23, 2024
`xrow_header_encode` performs multiple tasks at once: * allocates memory for the row header (but not the body) * encodes the header contents into the allocated buffer * copies the body iovecs (if any) to the out iovec Let's rename it to `xrow_encode`, and do the same for `xrow_header_decode` for consistency. Factor out the header encoding code into `xrow_header_encode` function, which will only encode the header to the given buffer, but won't be concerned with other tasks. In-scope-of tarantool#10161 NO_CHANGELOG=refactoring NO_TEST=refactoring NO_DOC=refactoring
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 23, 2024
In scope of tarantool#10161 NO_CHANGELOG=internal NO_TEST=internal NO_DOC=internal
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 23, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 23, 2024
New commits: * test: fix memory leaks reported by LSAN * region: fix memleak in ASAN version * matras: introduce `matras_needs_touch` and `matras_touch_no_check` * lsregion: implement lsregion_reserve for asan build Prerequisite tarantool#10161 NO_CHANGELOG=submodule bump NO_TEST=submodule bump NO_DOC=submodule bump
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 23, 2024
`xrow_header_encode` performs multiple tasks at once: * allocates memory for the row header (but not the body) * encodes the header contents into the allocated buffer * copies the body iovecs (if any) to the out iovec Let's rename it to `xrow_encode`, and do the same for `xrow_header_decode` for consistency. Factor out the header encoding code into `xrow_header_encode` function, which will only encode the header to the given buffer, but won't be concerned with other tasks. In-scope-of tarantool#10161 NO_CHANGELOG=refactoring NO_TEST=refactoring NO_DOC=refactoring
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 23, 2024
In scope of tarantool#10161 NO_CHANGELOG=internal NO_TEST=internal NO_DOC=internal
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 23, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 24, 2024
New commits: * test: fix memory leaks reported by LSAN * region: fix memleak in ASAN version * matras: introduce `matras_needs_touch` and `matras_touch_no_check` * lsregion: implement lsregion_reserve for asan build Prerequisite tarantool#10161 NO_CHANGELOG=submodule bump NO_TEST=submodule bump NO_DOC=submodule bump
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 24, 2024
`xrow_header_encode` performs multiple tasks at once: * allocates memory for the row header (but not the body) * encodes the header contents into the allocated buffer * copies the body iovecs (if any) to the out iovec Let's rename it to `xrow_encode`, and do the same for `xrow_header_decode` for consistency. Factor out the header encoding code into `xrow_header_encode` function, which will only encode the header to the given buffer, but won't be concerned with other tasks. In-scope-of tarantool#10161 NO_CHANGELOG=refactoring NO_TEST=refactoring NO_DOC=refactoring
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 24, 2024
In scope of tarantool#10161 NO_CHANGELOG=internal NO_TEST=internal NO_DOC=internal
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Jul 24, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes tarantool#10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
sergepetrenko
added a commit
that referenced
this issue
Jul 24, 2024
New commits: * test: fix memory leaks reported by LSAN * region: fix memleak in ASAN version * matras: introduce `matras_needs_touch` and `matras_touch_no_check` * lsregion: implement lsregion_reserve for asan build Prerequisite #10161 NO_CHANGELOG=submodule bump NO_TEST=submodule bump NO_DOC=submodule bump
sergepetrenko
added a commit
that referenced
this issue
Jul 24, 2024
`xrow_header_encode` performs multiple tasks at once: * allocates memory for the row header (but not the body) * encodes the header contents into the allocated buffer * copies the body iovecs (if any) to the out iovec Let's rename it to `xrow_encode`, and do the same for `xrow_header_decode` for consistency. Factor out the header encoding code into `xrow_header_encode` function, which will only encode the header to the given buffer, but won't be concerned with other tasks. In-scope-of #10161 NO_CHANGELOG=refactoring NO_TEST=refactoring NO_DOC=refactoring
sergepetrenko
added a commit
that referenced
this issue
Jul 24, 2024
In scope of #10161 NO_CHANGELOG=internal NO_TEST=internal NO_DOC=internal
sergepetrenko
added a commit
that referenced
this issue
Jul 24, 2024
Relay sends rows one by one, issuing one `write()` system call per each sent row. This hurts performance quite a lot: in the 1mops_write test [1] on my machine master node is able to reach ~ 1.9 million RPS while replication performance peaks at ~ 0.5 million RPS. As shown by a PoC patch, batching the rows allows to remove the replication bottleneck almost completely. Even sending 2 rows per one write() call already doubles the performance, and sending 64 rows per one write() call allows the replica to catch up with master's speed [2]. This patch makes relay buffer rows until the total buffer size reaches a certain threshold, and then flush everything with a single writev() call. The flush threshold is configurable via a new tweak - xrow_buf_flush_size, which has a default value of 16 kb. A size threshold is chosen over simple row count, because it's more flexible for different row sizes: if the user has multi-kilobyte tuples, batching hundreds of such tuples might overflow the socket's send buffer size, resulting in extra wait until the send buffer becomes free again. This extra time might've been used for enqueuing extra rows. The problem would become even worse for hundred-kilobyte tuples and so on. At the same time the user might have different workload types. One with large tuples and other with relatively small ones, in which case batching by size rather than by row count is more predictable performance wise. [1] https://github.com/tarantool/tarantool/blob/master/perf/lua/1mops_write.lua [2] https://www.notion.so/tarantool/relay-248de4a9600f4c4d8d83e81cf1104926 Closes #10161 NO_DOC=not a documentable feature NO_TEST=covered by existing tests
locker
pushed a commit
that referenced
this issue
Sep 9, 2024
New commits: * test: fix memory leaks reported by LSAN * region: fix memleak in ASAN version * matras: introduce `matras_needs_touch` and `matras_touch_no_check` * lsregion: implement lsregion_reserve for asan build Prerequisite #10161 NO_CHANGELOG=submodule bump NO_TEST=submodule bump NO_DOC=submodule bump (cherry picked from commit c191a1b)
CuriousGeorgiy
added a commit
to CuriousGeorgiy/tarantool
that referenced
this issue
Sep 22, 2024
To overcome the throughput limitation of synchronous transactions let's allow to commit them using asynchronous wait modes (2c66624). Such transactions will have the same consistency guarantees with `read-confirmed` and `linearizable` isolation levels. Changes made this way can be observable with the `read-committed` isolation level. Currently, we return an error when the synchronous queue is full for the only supported synchronous transaction commit mode, `complete`. However, following the journal queue waiting logic, let's return an error only for the `none` commit mode. Otherwise, we will wait for space in the synchronous queue. Closes tarantool#10583 @TarantoolBot document Title: Asynchronous commit modes for synchronous transactions Product: Tarantool Since: 3.3 Root document: https://www.tarantool.io/en/doc/latest/reference/reference_lua/box_txn_management/commit/ and https://www.tarantool.io/ru/doc/latest/reference/reference_lua/box_txn_management/atomic/ and https://www.tarantool.io/en/doc/latest/platform/replication/repl_sync/ lsregion: implement lsregion_reserve for asan build lsregion_reserve() was never used separately from lsregion_alloc in our code base, so a separate lsregion_reserve() implementation for asan wasn't needed. This is going to change, so let's implement the missing method. Also reenable basic reserve tests for asan build. Needed for tarantool#10161
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently there is one write() per row in relay. This needs to be optimized to save on syscalls.
This is a reincarnation of an old ticket created by @kostja.
The text was updated successfully, but these errors were encountered: