Configure larger BufReader for send and receive #263

rawler · 2024-06-04T20:58:33Z

Larger BufReader can yield slightly faster transfer, at slightly lower system overhead.

An example benchmark, sending 512MB (cached) from local disk over 127.0.0.1, 100 times using 4 concurrent transfers:

| BUFFER_SIZE |    latency    |  CPU usage     |
|             |  P50  |   P95 |  USR  |  SYS   |
| 1KB         | 817ms | 891ms | 0.76s | 19.89s |
| 32KB        | 774ms | 868ms | 1.30s | 18.50s |
| 128KB       | 711ms | 774ms | 1.18s | 16.77s |

In more complex scenarios, the difference can be higher. The tradeoff is more memory used for each concurrent connection.

Larger BufReader can yield slightly faster transfer, at slightly lower system overhead. An example benchmark, sending 512MB (cached) from local disk over 127.0.0.1, 100 times using 4 concurrent transfers: | BUFFER_SIZE | latency | CPU usage | | | P50 | P95 | USR | SYS | | 1KB | 817ms | 891ms | 0.76s | 19.89s | | 32KB | 774ms | 868ms | 1.30s | 18.50s | | 128KB | 711ms | 774ms | 1.18s | 16.77s | In more complex scenarios, the difference can be higher. The tradeoff is more memory used for each concurrent connection.

rawler · 2024-06-04T21:13:31Z

This is to be seen as a basis for discussion. To be clear, I'm not sure increasing the send&receive buffers for connections to 32KB is acceptable for all uses of tiny-http. At least on Linux machines, the default limit on threads seems to be 1024. With the 1-thread per connection model of tiny-http, the potential possible memory-use here would be 2321024KB, or 64MB.

The net effect we're after, is that tracing one of our apps in production, we discovered that the syscall to send packets works with really small buffers of just 8KB, leading to lots of syscalls to transfer a 22GB file over a 50gbit network. These 8KB transfer buffers is the default behavior of rusts std::io::copy where it looks at any known buffer sizes of the reader and writer, but defaults and raises the copy-buffer to at least 8KB.

Increasing the send-buffer in our production environment significantly improved throughput and reduced system load, but since this environment is also receiving very fluctuating production traffic, it's difficult to present an exact benchmark.

As an alternative to configuring the sizes of BufReader and BufWriter, one could consider writing a custom implementation of std::io::copy instead, using a local stack buffer of ~64KB. Doing so would ensure that the memory is just used while actually streaming the response, and increasing the use of the thread stack might inflate memory less than fixed allocation per-connection. The bigger downside is that std::io::copy does have a few specializations for the cases when R and W are raw file-descriptor backed streams, using low-level kernel-splicing for extremely efficient copying. Not sure if that's ever applicable to tinyhttp?

Another option would be to let users configure buffer-size for copying, but that would require a more flexible way to construct the server than exists today.

rawler · 2024-06-04T21:34:30Z

NVM. I just realized the variant we actually tested in prod (manual reimpl of std::io::copy) is way outperforming the BufWriter-size solution.

4x100 connections of 512MB with a 32KB copy-buffer here only took a P95 of 616ms, using 0.21 + 13.45s USR and SYS CPU. I'll try to produce an adapted PR tomorrow.

rawler force-pushed the tune-buffer branch from 214e4a1 to 9e6dd54 Compare June 4, 2024 21:05

rawler closed this Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure larger BufReader for send and receive #263

Configure larger BufReader for send and receive #263

Configure larger BufReader for send and receive #263

Configure larger BufReader for send and receive #263

Conversation