[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
|
|
Subscribe / Log in / New account

Automatic buffer selection for io_uring

By Jonathan Corbet
March 20, 2020
The io_uring subsystem has, in the last year, redefined how asynchronous I/O is done on Linux systems. As this subsystem grows in both capability and users, though, it starts to run into limitations in the types of operations that can be expressed. That is driving a number of changes in how operations are programmed for io_uring. One example is the mechanisms considered for carrying a file descriptor between operations that was covered here in early March. Another has to do with how I/O buffers are chosen for operations.

As io_uring developer Jens Axboe describes in this patch set, the usual mode for programs that handle large numbers of file descriptors is to use poll() to find out which descriptors are ready for I/O, then making separate calls to actually perform that I/O. One could use io_uring in this mode, but it defeats one of the purposes of the whole exercise: avoiding system calls whenever possible. The io_uring way of doing things is to just queue an asynchronous operation on every file descriptor, then react to the resulting events whenever one of those operations is executed.

Working that way can indeed reduce system calls — all the way to zero if the request ring is kept full. But it also requires allocating a separate I/O buffer for each of those queued operations, even though many of them may not execute for an indefinite period of time. The poll() method, instead, allows an application to defer buffer allocation until a buffer is actually needed. Losing that flexibility can result in significantly higher memory use for applications that keep a large number of operations outstanding.

What is needed here is some sort of mechanism that allows buffers to be allocated to operations after they have been queued in the ring. The answer is, of course, obvious: add a hook for a BPF program that can perform buffer management in the kernel at the moment that any particular operation is able to go forward. Why even try anything else? Unfortunately, Axboe said, "I had a hard time imagining how life times of the buffer could be managed through [BPF]", so that idea went by the wayside.

Fortunately, there was another idea waiting in the wings: have the application provide one or more buffer pools to io_uring, which would then select a buffer from one of those pools whenever one is needed. That is what Axboe ended up implementing.

To use this mechanism, an application starts by queuing one or more IORING_OP_PROVIDE_BUFFERS operations to give a set of I/O buffers to the kernel. Each operation includes the base address of the buffer(s), a count of buffers, the size (the same for all buffers in this operation), a base buffer ID, and a group ID. If more than one buffer is included in the request, the buffer ID will be incremented by one for each after the first. There is no requirement that all buffers in a given group be the same size, but that seems to be the way that the mechanism is intended to be used.

Subsequently, operations can be queued without providing buffers at submission time; instead, the special value IOSQE_BUFFER_SELECT is used. The new buf_group field in the queue entry should be set to the ID of the group from which a buffer should be obtained when needed. When an operation unblocks and can proceed, the kernel will grab a buffer from the indicated group and use it. The size of the buffer is not considered during the selection process so, if the buffer is too small, the operation will not be able to complete properly. The ID of the selected buffer is returned with the operation's completion status.

If the requested buffer group is empty, the operation will fail with an ENOBUFS error. Once a buffer has been consumed by an operation, the kernel will not use it again until it has been given back with another IORING_OP_PROVIDE_BUFFERS request.

Only some operations support buffer selection in the current patch set; it is limited to read(), readv(), recv(), and recvmsg(). Earlier versions of the patch set supported write(), though your editor will freely admit to being baffled with regard to how that was supposed to actually work even after looking at the code; that support was removed in version 3.

This work has not yet found its way into linux-next, but there is still some time before the 5.7 merge window opens. So there is a chance that the buffer-selection feature could yet land in the next development cycle. That will increase the flexibility of io_uring operations, and no BPF hooks are required.

Index entries for this article
KernelAsynchronous I/O
Kernelio_uring


to post comments

Automatic buffer selection for io_uring

Posted Mar 20, 2020 15:27 UTC (Fri) by axboe (subscriber, #904) [Link]

Thanks for covering this work! One minor correction:

> This work has not yet found its way into linux-next, but there is still some time before the 5.7 merge window opens.

It is actually queued up, since a few weeks ago.

Automatic buffer selection for io_uring

Posted Mar 21, 2020 16:27 UTC (Sat) by nilsmeyer (guest, #122604) [Link] (5 responses)

Is there any software in the wild already using io_uring?

Software in the wild

Posted Mar 21, 2020 16:41 UTC (Sat) by corbet (editor, #1) [Link] (1 responses)

I would guess there's not much yet outside of company data centers. Some of the features we are seeing, including the buffer selection one, are being influenced by the needs of the in-progress PostgreSQL support work, though.

Software in the wild

Posted Mar 21, 2020 17:23 UTC (Sat) by nilsmeyer (guest, #122604) [Link]

I suppose since most distributions use rather old (feature wise) kernels there isn't much of an installed base.

Automatic buffer selection for io_uring

Posted Mar 22, 2020 6:27 UTC (Sun) by me@jasonclinton.com (subscriber, #52701) [Link]

There's an embedded database called sled that was presented at FOSDEM this year: https://fosdem.org/2020/schedule/event/rust_techniques_sled/

Automatic buffer selection for io_uring

Posted Mar 22, 2020 6:36 UTC (Sun) by miquels (guest, #59247) [Link]

The latest Samba has support for a io_uring vfs module.

Automatic buffer selection for io_uring

Posted Mar 23, 2020 6:37 UTC (Mon) by jezuch (subscriber, #52988) [Link]

I had a vision that for example Rust's new async-await could have an IO scheuler, in addition to the task scheduler, which would automagically use io_uring behind the scenes (and a compat one for older kernels). I have no idea if this is actually happening, but this is such a brilliant idea that they absolutely should do it ;)

Automatic buffer selection for io_uring

Posted Mar 23, 2020 19:50 UTC (Mon) by lorddoskias (subscriber, #95746) [Link] (1 responses)

I'n confused by the description regarding buffer size. Initially it's claimed that all buffers must have identical size, presumably because in the struct describing the request there is a single size_t:

> Each operation includes the base address of the buffer(s), a count of buffers, the size (the same for all buffers in this operation)

The very next sentence mentions: There is no requirement that all buffers in a given group be the same size . I assume one IORING_OP_PROVIDE_BUFFERS call provides 1 group of N buffers. So which one is true?

Buffer sizes

Posted Mar 23, 2020 20:10 UTC (Mon) by corbet (editor, #1) [Link]

Both are true. You can make multiple IORING_OP_PROVIDE_BUFFERS calls to add buffers to the same group; indeed, that will be the normal mode of operation assuming that buffers are used more than once. Different calls can provide different sizes.

Automatic buffer selection for io_uring

Posted Mar 26, 2020 4:24 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

> Earlier versions of the patch set supported write(), though your editor will freely admit to being baffled with regard to how that was supposed to actually work even after looking at the code; that support was removed in version 3.

Was it perhaps intended to provide some sort of "reuse the buffer from the previous operation" semantic?


Copyright © 2020, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds