ioctl() for io_uring
The ioctl() name comes from "I/O control"; this system call was added as a way of performing operations on peripheral devices that went beyond reading and writing data. It could be used to rewind a tape drive, set the baud rate of a serial port, or eject a removable disk, for example. Over the years, uses of ioctl() have grown far beyond such simple applications, with some APIs (media, for example) providing hundreds of operations.
The criticism of ioctl() comes from its multiplexed and device-dependent nature; almost anything that can be represented by a file descriptor supports ioctl(), but the actual operations supported vary from one to the next. While system calls are (in theory, at least) closely scrutinized before being added to the kernel, ioctl() commands often receive close to no review at all. So nobody really knows everything that can be done with ioctl(). For added fun, there is some overlap in the command space, meaning that an ioctl() call made to the wrong file descriptor could have unexpected and highly unpleasant results. Attempts have been made to avoid this problem, but they have not been completely successful.
After dealing with these problems for years, some developers would like to see ioctl() disappear completely, but nobody has ever come up with a replacement that looks materially better. Adding a new system call for every function that might be implemented with ioctl() is a non-starter; having device drivers interpret command streams sent with write() is even worse. There probably is no better way to, for example, tell a camera sensor which color space to use.
It is natural to want to support ioctl() in io_uring; it is not uncommon to mix ioctl() calls with regular I/O, and it would be useful to be able to do everything asynchronously. But every ioctl() call is different, and none of them were designed for asynchronous execution, so an ioctl() implementation within io_uring would have no choice but to execute every call in a separate thread. That might be better than nothing, but it is not anywhere near as efficient as it could be, especially for calls that can be executed right away. Doing ioctl() right for io_uring essentially calls for reinventing the ioctl() interface.
Operations in io_uring are communicated from user space to the kernel via a ring buffer; each is represented as an instance of the somewhat complex io_uring_sqe structure. The new command mechanism is invoked by setting opcode in that structure to IORING_OP_URING_CMD; the fd field must, as usual, contain the file descriptor to operate on. The rest of the structure, though (starting with the off field) is overlaid with something completely different:
struct io_uring_pdu { __u64 data[4]; /* available for free use */ __u64 reserved; /* can't be used by application! */ __u64 data2; /* available for free use */ };
The reserved field overlays user_data in the original structure, which is needed for other purposes; thus, no data relevant to the command can be stored there. Applications are unlikely to see this structure, though; it will be overlaid yet again with a structure specific to the command to be executed. For block-subsystem commands, for example, this structure becomes:
struct block_uring_cmd { __u16 op; __u16 pad; union { __u32 size; __u32 ioctl_cmd; }; __u64 addr; __u64 unused[2]; __u64 reserved; /* can never be used */ __u64 unused2; };
Deep down within this structure is ioctl_cmd, which the application should set to the ioctl() command code of interest; the op field should be BLOCK_URING_OP_IOCTL (for now; in the future there could be operations that are not tied to an ioctl() call). In the patch set, the only supported command is BLKBSZGET, which returns the block size of the underlying block device — something that can clearly be done without performing actual I/O or sleeping. The patch set also implements a couple of networking commands using a different structure.
Within the kernel, any subsystem that wants to support io_uring operations must add yet another field to the forever-growing file_operations structure:
struct io_uring_cmd { struct file *file; struct io_uring_pdu pdu; void (*done)(struct io_uring_cmd *, ssize_t); }; int (*uring_cmd)(struct io_uring_cmd *, enum io_uring_cmd_flags);
Needless to say, any handlers for io_uring IORING_OP_URING_CMD operations should not block. Instead, they can complete the operation immediately, return an error indicating that the operation would block, or run the operation asynchronously and signal completion by calling the given done() function.
This is an initial posting of a change that could have long-term
implications, so it would not be surprising to see significant changes
before it makes it into the mainline. Indeed, in response to a comment from Darrick
Wong, Axboe tweaked
the interface to provide eight more bytes of space
in struct io_uring_pdu — something that Wong said would be highly
useful to be able to submit the "millions upon millions of ioctl
calls
" created by the xfs_scrub utility.
Whether the addition of an ioctl()-like interface to io_uring —
which is rapidly evolving into a sort of shadow, asynchronous system-call
interface for Linux — will generate controversy remains to be seen; there
has been none in response to the initial posting. Axboe expressed
hope that the new commands will be "a lot more sane and
useful
" than the existing ioctl() commands, but there
doesn't seem to be any way to enforce that. As with ioctl(), the
addition of new io_uring commands will happen entirely within other
subsystems, and the level of scrutiny those additions receive will vary.
But io_uring needs this sort of "miscellaneous command" capability in the
same way that the system as a whole needs ioctl(), so it would be surprising
if this feature were not eventually merged in some form.
Index entries for this article | |
---|---|
Kernel | ioctl() |
Kernel | io_uring |
Posted Feb 4, 2021 18:11 UTC (Thu)
by johill (subscriber, #25196)
[Link] (6 responses)
I'm not even completely joking - think about doing a sort of 'grep-like' thing completely in kernel and only pushing out stuff you cared about.
Posted Feb 5, 2021 0:35 UTC (Fri)
by jpsamaroo (guest, #129727)
[Link] (1 responses)
Posted Feb 5, 2021 8:32 UTC (Fri)
by johill (subscriber, #25196)
[Link]
The bigger issue I expect is the root vs. non-root discussion, though in some use cases I suppose it might be sufficient to allow root to use BPF programs in io_uring.
And, tbh, the fact that it's not clear what a BPF program could even _do_. I mean, io_uring would've copied out data to userspace, and then BPF would have to copy it back in? That doesn't make much sense. So you'd have to have kernel buffers here? Unless it gets restricted to *managing* the operations.
Posted Feb 5, 2021 14:12 UTC (Fri)
by rhdxmr (guest, #44404)
[Link] (3 responses)
Posted Feb 6, 2021 20:14 UTC (Sat)
by matthias (subscriber, #94967)
[Link] (2 responses)
Posted Feb 8, 2021 8:05 UTC (Mon)
by johill (subscriber, #25196)
[Link] (1 responses)
Posted Feb 8, 2021 9:31 UTC (Mon)
by Sesse (subscriber, #53779)
[Link]
io_uring, on the other hand, I can perfectly well see lots of uses for (and I have used it myself).
Posted Feb 4, 2021 23:57 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
Posted Feb 5, 2021 0:50 UTC (Fri)
by wahern (subscriber, #37304)
[Link] (7 responses)
Posted Feb 5, 2021 1:37 UTC (Fri)
by roc (subscriber, #30627)
[Link] (2 responses)
Posted Feb 6, 2021 1:21 UTC (Sat)
by wahern (subscriber, #37304)
[Link] (1 responses)
But of course if there's a TOCTTOU exploit then I would expect the code (including the *existing* kernel code) to copy locally first. It's not like there can't be TOCTTOU races with a flat structure. TOCTTOU races is why you can't use BPF to filter raw path names before permitting an open--because the process could overwrite the pathname after the BPF filter but before its passed to the internal open routines. To do it correctly the pathname would need to be copied to a local, in-kernel buffer, but for various reasons that's not a solution most people like.
Posted Feb 6, 2021 2:13 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> It's not like there can't be TOCTTOU races with a flat structure.
Posted Feb 5, 2021 4:08 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Moreover, this will make it much easier for BPF hooks to inspect the content or for the audit subsystem to log it. A lot of stuff will become possible.
Windows NT kernel has just such a design and it really make some stuff easier.
Posted Feb 6, 2021 10:20 UTC (Sat)
by ibukanov (guest, #3942)
[Link] (2 responses)
In languages allowing more abstractions and overloading of short names that problem does not exist and one can have rather convenient way to access the data even when they are not aligned.
Posted Feb 6, 2021 13:28 UTC (Sat)
by sbaugh (guest, #103291)
[Link] (1 responses)
Posted Feb 6, 2021 15:43 UTC (Sat)
by ibukanov (guest, #3942)
[Link]
But again, programming with far/near pointers in Borland C++ in 1992 was too painful. A company where I got my first programming job abandoned that and used 32 pointers for everything even if it harmed performance and lead to fatter binaries. But then again, C++ was a new thing then and the company was totally sold on the idea of objects leading to tight coupling of data and code, when segments require things like data-oriented programming.
Posted Feb 5, 2021 1:37 UTC (Fri)
by roc (subscriber, #30627)
[Link] (6 responses)
Why?
Posted Feb 5, 2021 9:50 UTC (Fri)
by sur5r (subscriber, #61490)
[Link] (5 responses)
Posted Feb 5, 2021 10:12 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
And moreover, quite often devices already require some kind of structured data for their IO streams.
Posted Feb 8, 2021 11:16 UTC (Mon)
by metan (subscriber, #74107)
[Link] (1 responses)
Posted Feb 8, 2021 15:19 UTC (Mon)
by abatters (✭ supporter ✭, #6932)
[Link]
They already exist; see preadv2() and pwritev2().
Posted Feb 5, 2021 10:59 UTC (Fri)
by roc (subscriber, #30627)
[Link]
Posted Feb 11, 2021 3:16 UTC (Thu)
by cozzyd (guest, #110972)
[Link]
Posted Feb 5, 2021 12:26 UTC (Fri)
by itsmycpu (guest, #139639)
[Link] (2 responses)
Any info based on actual measurements appreciated!
Posted Feb 5, 2021 22:47 UTC (Fri)
by dougg (guest, #1894)
[Link]
I do similar measurements with scsi_debug devices in place of nullblk devices.
Posted Feb 18, 2021 7:46 UTC (Thu)
by ksandstr (guest, #60862)
[Link]
It would seem that for ordinary "control" syscalls io_uring gains mostly from reducing the number of trips through the syscall interface when asynchronous jobs are submitted and their results retrieved in bulk. The downside is that CPUs are rather good at syscalls, and even better at dealing with data already in the cache. So the bar to beating a good old syscall is rather high even post-Meltdown.
Without even microbenchmarks it's hard to say for sure, but it seems that io_uring would be strictly worse for either throughput or latency when not applied to a sufficiently concurrent bulk use case[0]. Furthermore, example cases where io_uring use would be a net gain aren't readily apparent, in particular because it necessarily returns data generated in an off-core syscall to the originating core where (say) threads would go on with the rest of their operations first without their data leaving the core at all.
I'm rather skeptical of the necessity and utility of io_uring in real-world use cases. So far it looks like something that got its foot in the door and is now expanding in scope to become something like what XCB was for Xorg.
[0] bulk meaning many small calls, few large ones, or a happy medium of some kind.
Posted Aug 4, 2021 6:47 UTC (Wed)
by optimistyzy (guest, #152790)
[Link]
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
Without any syscall and copying buffer to userspace, all IO related tasks can be achieved.
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctls are far from a simple kernel interface. Mach is actually a good example that shows it can work.
Any realistic implementation would copy the structure buffer into kernel-controlled memory immediately. So yep, structures would solve the TOCTTOU issues once and for all.
ioctl() for io_uring
This is kind of a nonsense, though. The kernel ends up chasing pointers into the userspace and having to deal with TOCTOU races all the time. With a buffer you can copy it once as a solid block of memory and then the kernel code can be sure nothing changes underneath it. In most cases we probably would just be able to map the buffer onto a C struct and use it without any fear.
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
a) if you submit many as batch
b) if you submit them one by one.
ioctl() for io_uring
ioctl() for io_uring
ioctl() for io_uring
And are there any example code in user space to use this nice feature?