Automatic buffer selection for io_uring
As io_uring developer Jens Axboe describes in this patch set, the usual mode for programs that handle large numbers of file descriptors is to use poll() to find out which descriptors are ready for I/O, then making separate calls to actually perform that I/O. One could use io_uring in this mode, but it defeats one of the purposes of the whole exercise: avoiding system calls whenever possible. The io_uring way of doing things is to just queue an asynchronous operation on every file descriptor, then react to the resulting events whenever one of those operations is executed.
Working that way can indeed reduce system calls — all the way to zero if the request ring is kept full. But it also requires allocating a separate I/O buffer for each of those queued operations, even though many of them may not execute for an indefinite period of time. The poll() method, instead, allows an application to defer buffer allocation until a buffer is actually needed. Losing that flexibility can result in significantly higher memory use for applications that keep a large number of operations outstanding.
What is needed here is some sort of mechanism that allows buffers to be
allocated to operations after they have been queued in the ring. The
answer is, of course, obvious: add a hook for a BPF program that can
perform buffer management in the kernel at the moment that any particular
operation is able to go forward. Why even try anything else?
Unfortunately, Axboe said, "I had a hard time imagining how life
times of the buffer could be managed through [BPF]
", so that idea
went by the wayside.
Fortunately, there was another idea waiting in the wings: have the application provide one or more buffer pools to io_uring, which would then select a buffer from one of those pools whenever one is needed. That is what Axboe ended up implementing.
To use this mechanism, an application starts by queuing one or more IORING_OP_PROVIDE_BUFFERS operations to give a set of I/O buffers to the kernel. Each operation includes the base address of the buffer(s), a count of buffers, the size (the same for all buffers in this operation), a base buffer ID, and a group ID. If more than one buffer is included in the request, the buffer ID will be incremented by one for each after the first. There is no requirement that all buffers in a given group be the same size, but that seems to be the way that the mechanism is intended to be used.
Subsequently, operations can be queued without providing buffers at submission time; instead, the special value IOSQE_BUFFER_SELECT is used. The new buf_group field in the queue entry should be set to the ID of the group from which a buffer should be obtained when needed. When an operation unblocks and can proceed, the kernel will grab a buffer from the indicated group and use it. The size of the buffer is not considered during the selection process so, if the buffer is too small, the operation will not be able to complete properly. The ID of the selected buffer is returned with the operation's completion status.
If the requested buffer group is empty, the operation will fail with an ENOBUFS error. Once a buffer has been consumed by an operation, the kernel will not use it again until it has been given back with another IORING_OP_PROVIDE_BUFFERS request.
Only some operations support buffer selection in the current patch set; it is limited to read(), readv(), recv(), and recvmsg(). Earlier versions of the patch set supported write(), though your editor will freely admit to being baffled with regard to how that was supposed to actually work even after looking at the code; that support was removed in version 3.
This work has not yet found its way into linux-next, but there is still
some time before the 5.7 merge window opens. So there is a chance that the
buffer-selection feature could yet land in the next development cycle.
That will increase the flexibility of io_uring operations, and no BPF hooks
are required.
Index entries for this article | |
---|---|
Kernel | Asynchronous I/O |
Kernel | io_uring |
Posted Mar 20, 2020 15:27 UTC (Fri)
by axboe (subscriber, #904)
[Link]
> This work has not yet found its way into linux-next, but there is still some time before the 5.7 merge window opens.
It is actually queued up, since a few weeks ago.
Posted Mar 21, 2020 16:27 UTC (Sat)
by nilsmeyer (guest, #122604)
[Link] (5 responses)
Posted Mar 21, 2020 16:41 UTC (Sat)
by corbet (editor, #1)
[Link] (1 responses)
Posted Mar 21, 2020 17:23 UTC (Sat)
by nilsmeyer (guest, #122604)
[Link]
Posted Mar 22, 2020 6:27 UTC (Sun)
by me@jasonclinton.com (subscriber, #52701)
[Link]
Posted Mar 22, 2020 6:36 UTC (Sun)
by miquels (guest, #59247)
[Link]
Posted Mar 23, 2020 6:37 UTC (Mon)
by jezuch (subscriber, #52988)
[Link]
Posted Mar 23, 2020 19:50 UTC (Mon)
by lorddoskias (subscriber, #95746)
[Link] (1 responses)
> Each operation includes the base address of the buffer(s), a count of buffers, the size (the same for all buffers in this operation)
The very next sentence mentions: There is no requirement that all buffers in a given group be the same size . I assume one IORING_OP_PROVIDE_BUFFERS call provides 1 group of N buffers. So which one is true?
Posted Mar 23, 2020 20:10 UTC (Mon)
by corbet (editor, #1)
[Link]
Posted Mar 26, 2020 4:24 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link]
Was it perhaps intended to provide some sort of "reuse the buffer from the previous operation" semantic?
Automatic buffer selection for io_uring
Automatic buffer selection for io_uring
I would guess there's not much yet outside of company data centers. Some of the features we are seeing, including the buffer selection one, are being influenced by the needs of the in-progress PostgreSQL support work, though.
Software in the wild
Software in the wild
Automatic buffer selection for io_uring
The latest Samba has support for a io_uring vfs module.
Automatic buffer selection for io_uring
Automatic buffer selection for io_uring
Automatic buffer selection for io_uring
Both are true. You can make multiple IORING_OP_PROVIDE_BUFFERS calls to add buffers to the same group; indeed, that will be the normal mode of operation assuming that buffers are used more than once. Different calls can provide different sizes.
Buffer sizes
Automatic buffer selection for io_uring