ioctl() for io_uring

By Jonathan Corbet
February 4, 2021

Of all the system calls in the Unix tradition, few are as maligned as ioctl(). But ioctl() exists for a reason — for many reasons, in truth — and cannot be expected to go away anytime soon. It is thus unsurprising that there is interest in providing ioctl()-like functionality in the io_uring subsystem. A recent RFC patch set from Jens Axboe shows the form that this feature might take in the io_uring context.

The ioctl() name comes from "I/O control"; this system call was added as a way of performing operations on peripheral devices that went beyond reading and writing data. It could be used to rewind a tape drive, set the baud rate of a serial port, or eject a removable disk, for example. Over the years, uses of ioctl() have grown far beyond such simple applications, with some APIs (media, for example) providing hundreds of operations.

The criticism of ioctl() comes from its multiplexed and device-dependent nature; almost anything that can be represented by a file descriptor supports ioctl(), but the actual operations supported vary from one to the next. While system calls are (in theory, at least) closely scrutinized before being added to the kernel, ioctl() commands often receive close to no review at all. So nobody really knows everything that can be done with ioctl(). For added fun, there is some overlap in the command space, meaning that an ioctl() call made to the wrong file descriptor could have unexpected and highly unpleasant results. Attempts have been made to avoid this problem, but they have not been completely successful.

After dealing with these problems for years, some developers would like to see ioctl() disappear completely, but nobody has ever come up with a replacement that looks materially better. Adding a new system call for every function that might be implemented with ioctl() is a non-starter; having device drivers interpret command streams sent with write() is even worse. There probably is no better way to, for example, tell a camera sensor which color space to use.

It is natural to want to support ioctl() in io_uring; it is not uncommon to mix ioctl() calls with regular I/O, and it would be useful to be able to do everything asynchronously. But every ioctl() call is different, and none of them were designed for asynchronous execution, so an ioctl() implementation within io_uring would have no choice but to execute every call in a separate thread. That might be better than nothing, but it is not anywhere near as efficient as it could be, especially for calls that can be executed right away. Doing ioctl() right for io_uring essentially calls for reinventing the ioctl() interface.

Operations in io_uring are communicated from user space to the kernel via a ring buffer; each is represented as an instance of the somewhat complex io_uring_sqe structure. The new command mechanism is invoked by setting opcode in that structure to IORING_OP_URING_CMD; the fd field must, as usual, contain the file descriptor to operate on. The rest of the structure, though (starting with the off field) is overlaid with something completely different:

    struct io_uring_pdu {
	__u64 data[4];	/* available for free use */
	__u64 reserved;	/* can't be used by application! */
	__u64 data2;	/* available for free use */
    };

The reserved field overlays user_data in the original structure, which is needed for other purposes; thus, no data relevant to the command can be stored there. Applications are unlikely to see this structure, though; it will be overlaid yet again with a structure specific to the command to be executed. For block-subsystem commands, for example, this structure becomes:

    struct block_uring_cmd {
	__u16 	op;
	__u16	pad;
	union {
	    __u32	size;
	    __u32	ioctl_cmd;
	};
	__u64	addr;
	__u64	unused[2];
	__u64	reserved;	/* can never be used */
	__u64	unused2;
    };

Deep down within this structure is ioctl_cmd, which the application should set to the ioctl() command code of interest; the op field should be BLOCK_URING_OP_IOCTL (for now; in the future there could be operations that are not tied to an ioctl() call). In the patch set, the only supported command is BLKBSZGET, which returns the block size of the underlying block device — something that can clearly be done without performing actual I/O or sleeping. The patch set also implements a couple of networking commands using a different structure.

Within the kernel, any subsystem that wants to support io_uring operations must add yet another field to the forever-growing file_operations structure:

    struct io_uring_cmd {
	struct file *file;
	struct io_uring_pdu pdu;
	void (*done)(struct io_uring_cmd *, ssize_t);
    };

    int (*uring_cmd)(struct io_uring_cmd *, enum io_uring_cmd_flags);

Needless to say, any handlers for io_uring IORING_OP_URING_CMD operations should not block. Instead, they can complete the operation immediately, return an error indicating that the operation would block, or run the operation asynchronously and signal completion by calling the given done() function.

This is an initial posting of a change that could have long-term implications, so it would not be surprising to see significant changes before it makes it into the mainline. Indeed, in response to a comment from Darrick Wong, Axboe tweaked the interface to provide eight more bytes of space in struct io_uring_pdu — something that Wong said would be highly useful to be able to submit the "millions upon millions of ioctl calls" created by the xfs_scrub utility.

Whether the addition of an ioctl()-like interface to io_uring — which is rapidly evolving into a sort of shadow, asynchronous system-call interface for Linux — will generate controversy remains to be seen; there has been none in response to the initial posting. Axboe expressed hope that the new commands will be "a lot more sane and useful" than the existing ioctl() commands, but there doesn't seem to be any way to enforce that. As with ioctl(), the addition of new io_uring commands will happen entirely within other subsystems, and the level of scrutiny those additions receive will vary. But io_uring needs this sort of "miscellaneous command" capability in the same way that the system as a whole needs ioctl(), so it would be surprising if this feature were not eventually merged in some form.

Index entries for this article
Kernel	ioctl()
Kernel	io_uring

ioctl() for io_uring

Posted Feb 4, 2021 18:11 UTC (Thu) by johill (subscriber, #25196) [Link] (6 responses)

So when will io_uring grow support for running BPF programs in the middle? :-)

I'm not even completely joking - think about doing a sort of 'grep-like' thing completely in kernel and only pushing out stuff you cared about.

ioctl() for io_uring

Posted Feb 5, 2021 0:35 UTC (Fri) by jpsamaroo (guest, #129727) [Link] (1 responses)

Honestly, I thought the exact same thing just before seeing your comment. I think it's probably only a short time before we see someone posting such a patch to the ML. Of course, that could become an easy way to DDoS one's kernel, unless the kernel limits the number of io_uring-submitted BPF programs that can be running at once.

ioctl() for io_uring

Posted Feb 5, 2021 8:32 UTC (Fri) by johill (subscriber, #25196) [Link]

Well, there are sleepable BPF programs (now, see commit 1e6c62a88215) so with some further work in that area it could simply be mandated that such programs be used in that kind of context.

The bigger issue I expect is the root vs. non-root discussion, though in some use cases I suppose it might be sufficient to allow root to use BPF programs in io_uring.

And, tbh, the fact that it's not clear what a BPF program could even _do_. I mean, io_uring would've copied out data to userspace, and then BPF would have to copy it back in? That doesn't make much sense. So you'd have to have kernel buffers here? Unless it gets restricted to *managing* the operations.

ioctl() for io_uring

Posted Feb 5, 2021 14:12 UTC (Fri) by rhdxmr (guest, #44404) [Link] (3 responses)

That would be awesome. io_uring supplies data to BPF program and all process is done inside kernel using BPF.
Without any syscall and copying buffer to userspace, all IO related tasks can be achieved.

ioctl() for io_uring

Posted Feb 6, 2021 20:14 UTC (Sat) by matthias (subscriber, #94967) [Link] (2 responses)

That is still not thought to the end. Clearly, BPF needs to be able to issue new io_uring calls. After all, the calls to be made might depend on the processing of the data. We should definitely get rid of all those unnecessary context switches and abandon userspace altogether ;)

ioctl() for io_uring

Posted Feb 8, 2021 8:05 UTC (Mon) by johill (subscriber, #25196) [Link] (1 responses)

Isn't Linux a kind of BPF-(micro-?)kernel with legacy userspace support? ;-)

ioctl() for io_uring

Posted Feb 8, 2021 9:31 UTC (Mon) by Sesse (subscriber, #53779) [Link]

People keep bringing up this meme, but I honestly don't understand what all the fuzz is about. I keep looking for uses for eBPF, but I really can't find them—there are some admin tools to show me slow I/Os that I've used maybe once, and that's it. Is this really such a revolution?

io_uring, on the other hand, I can perfectly well see lots of uses for (and I have used it myself).

ioctl() for io_uring

Posted Feb 4, 2021 23:57 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

They're missing a chance to make ioctl() calls to be completely self-contained (i.e. no pointers to other structures).

ioctl() for io_uring

Posted Feb 5, 2021 0:50 UTC (Fri) by wahern (subscriber, #37304) [Link] (7 responses)

That would be ideal, but I suspect that for complex structures, and especially ones with dynamically sized components, there'll be a parade of buffer exploits for the inevitable ad hoc serializing and deserializing hacks people will write. (How many years did it take for the pace of /proc exploits to slow down?) The whole point of ioctl is to avoid the need for [de]marshaling in the kernel, and while it obviously has its own pitfalls and history of exploits, there's decades of pruning and security fixes there. If avoiding the churn and inevitable introduction of exploits means continuing to do some (though hopefully *less*) pointer chasing, so be it. There are worse possibilities than keeping around some copy_from_user calls, especially with page table isolation-like mitigations in place.

ioctl() for io_uring

Posted Feb 5, 2021 1:37 UTC (Fri) by roc (subscriber, #30627) [Link] (2 responses)

So every implementer of ioctls in the kernel continues to be vulnerable to TOCTOU vulnerabilities? awesome.

ioctl() for io_uring

Posted Feb 6, 2021 1:21 UTC (Sat) by wahern (subscriber, #37304) [Link] (1 responses)

My point was that *some* uses of pointers couldn't be easily replaced with a simple, flat structure. (I was responding to a comment about making *all* ioctls "completely self-contained".) In such cases ad hoc solutions could end up being much more complex, and bug prone, then pointer chasing. You can see how hairy things can get with sysctl structures on macOS, where someone tried too hard to pack things into a flat buffer and ended up introducing an exploit. And there's a checkered history of people trying to solve the problem "once and for all" by making use of XDR and other interchange formats for simple kernel interfaces.

But of course if there's a TOCTTOU exploit then I would expect the code (including the *existing* kernel code) to copy locally first. It's not like there can't be TOCTTOU races with a flat structure. TOCTTOU races is why you can't use BPF to filter raw path names before permitting an open--because the process could overwrite the pathname after the BPF filter but before its passed to the internal open routines. To do it correctly the pathname would need to be copied to a local, in-kernel buffer, but for various reasons that's not a solution most people like.

ioctl() for io_uring

Posted Feb 6, 2021 2:13 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

> And there's a checkered history of people trying to solve the problem "once and for all" by making use of XDR and other interchange formats for simple kernel interfaces.
ioctls are far from a simple kernel interface. Mach is actually a good example that shows it can work.

> It's not like there can't be TOCTTOU races with a flat structure.
Any realistic implementation would copy the structure buffer into kernel-controlled memory immediately. So yep, structures would solve the TOCTTOU issues once and for all.

ioctl() for io_uring

Posted Feb 5, 2021 4:08 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> The whole point of ioctl is to avoid the need for [de]marshaling in the kernel
This is kind of a nonsense, though. The kernel ends up chasing pointers into the userspace and having to deal with TOCTOU races all the time. With a buffer you can copy it once as a solid block of memory and then the kernel code can be sure nothing changes underneath it. In most cases we probably would just be able to map the buffer onto a C struct and use it without any fear.

Moreover, this will make it much easier for BPF hooks to inspect the content or for the audit subsystem to log it. A lot of stuff will become possible.

Windows NT kernel has just such a design and it really make some stuff easier.

ioctl() for io_uring

Posted Feb 6, 2021 10:20 UTC (Sat) by ibukanov (guest, #3942) [Link] (2 responses)

Part of the problem is that in C there is no good way to work with packet structs or arrays accessed not by a pointer but by an offset with an offset verification. If one designs the format carefully with everything properly aligned one can cast offsets into pointers, but then the verification part is still manual and error prone.

In languages allowing more abstractions and overloading of short names that problem does not exist and one can have rather convenient way to access the data even when they are not aligned.

ioctl() for io_uring

Posted Feb 6, 2021 13:28 UTC (Sat) by sbaugh (guest, #103291) [Link] (1 responses)

It's interesting to observe that programming with pointers which are actually offsets inside regions was widespread for a while, but then abandoned: segmented memory and near/far pointers. It would be neat if the legacy segmentation features of x86 could be reused to make deserialization safe...

ioctl() for io_uring

Posted Feb 6, 2021 15:43 UTC (Sat) by ibukanov (guest, #3942) [Link]

It was not abandoned entirely. Google used the segmented registers to implement NaCl on x86.

But again, programming with far/near pointers in Borland C++ in 1992 was too painful. A company where I got my first programming job abandoned that and used 32 pointers for everything even if it harmed performance and lead to fatter binaries. But then again, C++ was a new thing then and the company was totally sold on the idea of objects leading to tight coupling of data and code, when segments require things like data-oriented programming.

ioctl() for io_uring

Posted Feb 5, 2021 1:37 UTC (Fri) by roc (subscriber, #30627) [Link] (6 responses)

> having device drivers interpret command streams sent with write() is even worse.

Why?

ioctl() for io_uring

Posted Feb 5, 2021 9:50 UTC (Fri) by sur5r (subscriber, #61490) [Link] (5 responses)

Because in-band signalling is inherently error-prone as SS5 and 2600 have taught us.

ioctl() for io_uring

Posted Feb 5, 2021 10:12 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

It doesn't have to be in-band.

And moreover, quite often devices already require some kind of structured data for their IO streams.

ioctl() for io_uring

Posted Feb 8, 2021 11:16 UTC (Mon) by metan (subscriber, #74107) [Link] (1 responses)

Btw we already have out-of-band signalling on file descriptors, for instance sockets have MSG_ERRQUEUE. Too bad the read() and write() syscalls have no flags we can use to pass something like this in order to read/write serialized configuration instead of data. Maybe it's time add read2() and write2() with additional flag argument? :-)

ioctl() for io_uring

Posted Feb 8, 2021 15:19 UTC (Mon) by abatters (✭ supporter ✭, #6932) [Link]

> Maybe it's time add read2() and write2() with additional flag argument?

They already exist; see preadv2() and pwritev2().

ioctl() for io_uring

Posted Feb 5, 2021 10:59 UTC (Fri) by roc (subscriber, #30627) [Link]

I don't think that analogy works at all. It is completely routine for kernel APIs to accept structured data as input, and putting payload data inline with the control data does not make serialization any easier or harder. I say that write()ing formatted packets containing both the control and payload data is significantly *less* error-prone than juggling multiple buffers. If there's an "obvious" argument against having drivers interpret command streams sent via write(), I don't think this is it.

ioctl() for io_uring

Posted Feb 11, 2021 3:16 UTC (Thu) by cozzyd (guest, #110972) [Link]

You could have devices expose multiple files so a separate fd could be used for out of band control. But I'm not sure it solves more problems than it creates.

ioctl() for io_uring

Posted Feb 5, 2021 12:26 UTC (Fri) by itsmycpu (guest, #139639) [Link] (2 responses)

Anyone know what roughly the overhead is, for making an asynchronous call with io_uring, in nanoseconds?
a) if you submit many as batch
b) if you submit them one by one.

Any info based on actual measurements appreciated!

ioctl() for io_uring

Posted Feb 5, 2021 22:47 UTC (Fri) by dougg (guest, #1894) [Link]

Make some nullblk devices and use fio with the io_uring engine where the target device(s) is one or more nullblk devices. Then you can do the measurements yourself. You will be measuring the fio, io_uring, block layer and null_blk overhead with the latter being pretty small unless a lot of data is being moved.

I do similar measurements with scsi_debug devices in place of nullblk devices.

ioctl() for io_uring

Posted Feb 18, 2021 7:46 UTC (Thu) by ksandstr (guest, #60862) [Link]

Since io_uring consists of an early Direct3D-like "command buffer" interface, its time overhead is spent in de-/serialization, dispatch, reassembly, and cross-core cache latency as inputs and outputs move back and forth; with some I$ effect on top. Dispatch may be particularly expensive when kernel structures would be allocated and threads spawned to complete syscalls off-core.

It would seem that for ordinary "control" syscalls io_uring gains mostly from reducing the number of trips through the syscall interface when asynchronous jobs are submitted and their results retrieved in bulk. The downside is that CPUs are rather good at syscalls, and even better at dealing with data already in the cache. So the bar to beating a good old syscall is rather high even post-Meltdown.

Without even microbenchmarks it's hard to say for sure, but it seems that io_uring would be strictly worse for either throughput or latency when not applied to a sufficiently concurrent bulk use case[0]. Furthermore, example cases where io_uring use would be a net gain aren't readily apparent, in particular because it necessarily returns data generated in an off-core syscall to the originating core where (say) threads would go on with the rest of their operations first without their data leaving the core at all.

I'm rather skeptical of the necessity and utility of io_uring in real-world use cases. So far it looks like something that got its foot in the door and is now expanding in scope to become something like what XCB was for Xorg.

[0] bulk meaning many small calls, few large ones, or a happy medium of some kind.

ioctl() for io_uring

Posted Aug 4, 2021 6:47 UTC (Wed) by optimistyzy (guest, #152790) [Link]

When this feature can be supported in which linux kernel version?
And are there any example code in user space to use this nice feature?