Filesystem-oriented flags: sad, messy and not going away
One set of filesystem-oriented flags will be familiar to almost anybody who has worked with the Unix system-call API: the O_ flags supported by calls like open(). These flags affect how the call operates in a number of ways; O_CREAT will cause the named file to be opened if it does not already exist, O_NOFOLLOW causes the open to fail if the final component in the name is a symbolic link, O_NONBLOCK requests non-blocking operation, and so on. Some of those flags affect the lookup process (O_NOFOLLOW, for example) while others, like O_NONBLOCK, affect how the file descriptor created by the call will behave. All are part of one flag namespace that is recognized by all of the open() family of system calls.
open() is one way to create a new entry in a directory; link() is another. When the time came to add flags to link(), the linkat() system call was born; this system call also follows the other relatively new pattern of accepting a file descriptor for the directory in which the operation is to be performed. linkat() has a separate flag namespace (the "AT_ flags") with flags like AT_SYMLINK_FOLLOW, which is the opposite of O_NOFOLLOW. There is also an AT_SYMLINK_NOFOLLOW that is not recognized by linkat(), but which is understood by calls like fchmodat() and execveat(). There are more AT_ flags, such as AT_NO_AUTOMOUNT, supported by the relatively new statx() system call.
Then there is openat2(), which is coming with the 5.6 kernel. Rather than having a separate argument for flags, this system call requires a pointer to an open_how structure:
struct open_how { __u64 flags; __u64 mode; __u64 resolve; };
Here, flags contains the O_ flags common to the open() family, while resolve contains yet another set of flags (the "RESOLVE_ flags"). These include RESOLVE_BENEATH to limit the lookup to files below the provided directory and RESOLVE_NO_SYMLINKS, which is kind of like O_NOFOLLOW or AT_SYMLINK_NOFOLLOW but different: it blocks symbolic-link traversal at all stages of pathname traversal, rather than just for the final component.
LWN has occasionally covered the ongoing story of the proposed fsinfo() system call, which provides information about mounted filesystems. This new API also includes a structure pointer as one of its parameters:
struct fsinfo_params { __u32 at_flags; __u32 flags; __u32 request; __u32 Nth; __u32 Mth; __u64 __reserved[3]; };
Here, at_flags is, as one would expect, a set of AT_ flags, while flags is yet another set of flags specific to this system call. Recently, though, fsinfo() author David Howells noted that he had been told that RESOLVE_ flags should be used in preference to AT_ flags in all new system calls, and asked whether the AT_ flags should be considered deprecated. He followed up with a patch marking the AT_ flags as being deprecated and adding new RESOLVE_ flags to cover behaviors that can currently only be requested by AT_ flags. So, for example, he added RESOLVE_NO_TERMINAL_SYMLINKS (later renamed RESOLVE_NO_TRAILING_SYMLINKS) to request the same semantics as AT_SYMLINK_NOFOLLOW.
Christian Brauner argued
in favor of moving to RESOLVE_ flags, noting that some of the
semantics that are only available via those flags may be of use in settings
beyond openat(). He did allow, though, that "we might end
up causing more confusion for userspace due to yet another set of
flags
" — though others might argue that it's a bit late to worry
about that at this point.
Linus Torvalds, though, is not a fan of the plan to deprecate the AT_ flags; he noted that software will continue to use flags like O_NOFOLLOW or AT_SYMLINK_NOFOLLOW, so they can't go away. He added:
Adding multiple flags that do the same thing leads to complexity and confusion, he said; one might thus conclude that any such patch is unlikely to make it into the mainline. He later said that, if fsinfo() needs features controlled by both AT_ and RESOLVE_ flags, it should accept both; that, along with the flags specific to that system call, adds up to three different sets of flags for one call. One could reasonably conclude that if, for example, openat2() were to implement a feature controlled by an AT_ flag, it would have to accept a third set of flags as well.
So the situation may indeed be "sad and messy
", but it doesn't
appear that it will be getting any less messy anytime soon. Perhaps one of
the messiest aspects of this API is that there is no type checking for any
of these flags fields. Nothing but due care prevents a developer from
setting a flag in the wrong field. That one may be hard to correct in a
backward-compatible way, even if somebody were to be motivated to do it.
It is not the biggest mess to be found in our APIs; we'll continue to
muddle on with things as they are.
Index entries for this article | |
---|---|
Kernel | System calls |
Posted Mar 17, 2020 1:15 UTC (Tue)
by areilly (subscriber, #87829)
[Link] (32 responses)
Notionally also why the BSD executable versioning mechanism was introduced, although I don't know that that has ever been used as a mechanism to "tidy up" or unify older syscalls, rather than just providing OS emulation capabilities. The possibility exists though.
I like the way that the Rust (scheme) system handles language features and backwards compatibility: language version is an explicit part of the code preamble, which allows new code to link against old libraries without requiring the old libraries to be modified to match the current-version syntax or semantics. Something that neither C++ or python have managed during their history. All such a system requires is that whatever the new interfaces are, they must be capable of providing the original semantics somehow, so that the interface shim layer can re-implement the old API in terms of the new. Clearly you want a modicum of stability too, to minimize the number of old shims that you have to maintain.
Posted Mar 17, 2020 1:56 UTC (Tue)
by wahern (subscriber, #37304)
[Link] (12 responses)
Posted Mar 17, 2020 2:19 UTC (Tue)
by tux3 (subscriber, #101245)
[Link] (11 responses)
Afaik, Debian ships rust programs, and if I know anything about Debian packaging, static linking is not even close to being an option =]
Posted Mar 17, 2020 2:34 UTC (Tue)
by Conan_Kudo (subscriber, #103240)
[Link] (10 responses)
Unfortunately, you'd be wrong. Debian ships packages with piles of source code, just as Fedora does. Applications statically link everything, because otherwise every rebuild of the compiler would necessitate rebuilding everything. It's just not practical. Maybe one day, the Rust community will care about us and work toward defining a native stable ABI. But I won't hold my breath. The Rust community thinks it's okay to have to constantly build everything for every change, despite the huge downsides.
Posted Mar 17, 2020 10:25 UTC (Tue)
by roc (subscriber, #30627)
[Link] (4 responses)
In theory we could escape the dilemma by creating a stable ABI for shared libraries that you opt into at build time that would mostly be only for Linux distros. But even that would constrain language and library evolution as well as being a ton of work that no-one is really motivated to do.
Posted Mar 17, 2020 11:53 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
That way, you can extend the function to fill the state table as and when, without having to redesign the interface.
Cheers,
Posted Mar 17, 2020 21:21 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
COM solved that problem decades ago. We should seriously consider adopting something a lot like it. A stable object ABI that allows for both efficient intraprocess calling and extensible interprocess remoting is extremely powerful.
Posted Mar 18, 2020 8:53 UTC (Wed)
by roc (subscriber, #30627)
[Link] (1 responses)
Posted Mar 18, 2020 11:07 UTC (Wed)
by k3ninho (subscriber, #50375)
[Link]
*: common object request broker isn't a model, it's an architecture ;-)
K3n.
Posted Mar 17, 2020 21:33 UTC (Tue)
by rvolgers (guest, #63218)
[Link] (4 responses)
Rust does link dynamically to libc, and many Rust programs link to e.g. OpenSSL because Rust has good support for using dynamically linked C libraries. In fact, there are dynamic libraries with a C ABI that are implemented in Rust (librsvg comes to mind).
Rust has really good support for dynamic linking! It just doesn't have good support for dynamic linking using its *native ABI*. You could look at this as discouraging dynamic linking, but you can also look at it as encouraging dynamic linking that integrates well with the rest of the open source ecosystem by using the C ABI as a universal interface.
Also, a ton of Rust code is just not desirable to dynamically link, ever. We could do a cute experiment and compile some popular Rust programs while absolutely forbidding the compiler to inline functions between different crates (i.e. "libraries"). Pretty sure that will cause a code size explosion and speed reduction that will make people scream a lot louder than using a couple more kb of disk space.
Posted Mar 17, 2020 21:44 UTC (Tue)
by rvolgers (guest, #63218)
[Link] (2 responses)
Consider for example the Iterator trait in Rust. People expect code written using iterators to compile down to something that you would find hard to distinguish from a C for loop in disassembly, which requires the compiler to inline a whole bunch of calls to tiny functions and remove some intermediate values. And not all those tiny functions have to come from the same library, they can come from many different ones, and many will have generic arguments or callbacks with generic arguments from still other libraries.
And it's not just Iterator, the same goes for asynchronous I/O using Futures, and probably more absolutely core functionality that I'm forgetting about right now. As soon as parts of that become dynamically linked, you start having to make some really tough calls about what the compiler can statically assume and optimize out.
Posted Mar 17, 2020 21:47 UTC (Tue)
by areilly (subscriber, #87829)
[Link]
Posted Mar 20, 2020 10:15 UTC (Fri)
by jezuch (subscriber, #52988)
[Link]
(Disclaimer: I speak from theory, not practice, so I may be more than a little wrong :) )
Posted Mar 17, 2020 21:45 UTC (Tue)
by areilly (subscriber, #87829)
[Link]
Posted Mar 17, 2020 5:53 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (13 responses)
Posted Mar 17, 2020 6:00 UTC (Tue)
by areilly (subscriber, #87829)
[Link] (12 responses)
Posted Mar 17, 2020 6:18 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (11 responses)
You can create a set of new syscalls with sane flags: openat_rational, link_rational, open_I_really_mean_it's_not_broken_this_time and so on. You then expose these new syscalls with their wonderful flags through libc, libc can also provide their emulation for the older kernels that lack the new syscalls.
Then after 20 years or so you can remove the old flags from libc, so that new code will be able to use the new flags. Then after another 10 years or so, the old syscalls can be removed from the kernel.
Posted Mar 17, 2020 6:29 UTC (Tue)
by areilly (subscriber, #87829)
[Link] (10 responses)
Posted Mar 17, 2020 6:33 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (9 responses)
Posted Mar 17, 2020 6:37 UTC (Tue)
by josh (subscriber, #17465)
[Link] (4 responses)
Posted Mar 17, 2020 9:51 UTC (Tue)
by smurf (subscriber, #17840)
[Link] (3 responses)
A corollary is that statically-linked programs may or may not continue to work when you update your kernel, a notion which Linus emphatically rejects.
Posted Mar 17, 2020 10:25 UTC (Tue)
by josh (subscriber, #17465)
[Link]
That said, it'd be interesting if we had a slightly more extensible syscall layer that could tell when an argument was passed or not passed, which would allow existing existing syscalls without having to create new ones.
It's looking increasingly like io_uring might be that extensible syscall layer.
Posted Mar 19, 2020 20:55 UTC (Thu)
by BenHutchings (subscriber, #37955)
[Link] (1 responses)
Posted Mar 23, 2020 15:27 UTC (Mon)
by gray_-_wolf (subscriber, #131074)
[Link]
sometimes... would be nice if it never did but that is sadly not the case :/
Posted Mar 17, 2020 6:47 UTC (Tue)
by areilly (subscriber, #87829)
[Link] (3 responses)
The BSD versioned syscalls are in the kernel (so you can still have static executables), but they can be supplied by loadable kernel modules (as the linux and SCO syscalls are/were), which can eventually be deprecated or not loaded as suits the use-case, without getting (too much) in the way of the "fresh" syscall API.
Posted Mar 17, 2020 6:53 UTC (Tue)
by josh (subscriber, #17465)
[Link] (2 responses)
How does Solaris provide that to userspace? Similar to the VDSO, or via a library provided on the filesystem that calls an unstable kernel interface?
Posted Mar 17, 2020 7:41 UTC (Tue)
by areilly (subscriber, #87829)
[Link] (1 responses)
Posted Mar 18, 2020 21:37 UTC (Wed)
by justincormack (subscriber, #70439)
[Link]
OpenBSD has been taking this model to a more modern design, where libc is blessed, and only it can make syscalls, by having a special attribute set. This is designed as a security measure, to stop arbitrary code using syscalls.
Posted Mar 17, 2020 7:02 UTC (Tue)
by areilly (subscriber, #87829)
[Link] (1 responses)
I know that rust is versioning its releases too, but I don't know whether that actually allows for the linking and use of code written against different language versions, the way racket does. Racket code can import and use r5rs or r6rs or experimental-dialect code, which is cool. I suppose that C++ can do similarly for separately compiled object files, but it can't include old headers into new code, and python3 can't import python2 modules, which IMO is a terrible shame.
Posted Mar 17, 2020 15:15 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
The equivalent in Rust is Editions; you can freely link code between different editions (currently only 2015 and 2018), but the compiler will translate each translation unit (crate in Rust) according to the edition you have specified for that crate.
Nothing, however, stops a Rust 2015 crate using a Rust 2018 crate as a dependency, or vice-versa, and you can freely share data types between the two editions. The only problem is that you might have to use r#identifier syntax if one crate uses a reserved word as an identifier.
Posted Mar 17, 2020 10:42 UTC (Tue)
by roc (subscriber, #30627)
[Link] (2 responses)
If stability guarantees applied at a shared library boundary like Solaris and Windows, then rr would have to choose between manipulating the unstable syscall interface or manipulating the shared library interface. The former would increase the rr maintenance burden considerably since we'd have to support every version of the syscall interface. The latter is difficult to do in a watertight way. Similar considerations would apply to strace etc.
I think it's also a great feature to have your stable ABI boundary enforced by hardware. On Windows people reverse engineer the syscalls and sometimes call them directly, bypassing the "stable ABI"; it's great that on Linux you simply *can't* bypass it.
Hardware being aware of the ABI boundary has other more esoteric benefits. For example rr needs to count performance events happening outside "the kernel"; on Linux the CPU supports us doing that, but on Windows/Solaris it doesn't if you consider that shared library to be "the kernel".
Posted Mar 17, 2020 21:21 UTC (Tue)
by sorokin (guest, #88478)
[Link] (1 responses)
Wouldn't presence of vdso pose the same problem as the stability guarantee applied at shared library boundary?
Posted Mar 18, 2020 8:54 UTC (Wed)
by roc (subscriber, #30627)
[Link]
Posted Mar 17, 2020 18:42 UTC (Tue)
by ale2018 (guest, #128727)
[Link] (1 responses)
I would have said that the situation would look less messy if similar flags were set at the same bit. Having 64-bit flags seems to allow it. Hmm...
Posted Mar 20, 2020 12:43 UTC (Fri)
by draco (subscriber, #1792)
[Link]
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
This is notably more unstable than the C++ ABI, which (visual studio excepted) only breaks for major events like the C++11 release.
Filesystem-oriented flags: sad, messy and not going away
Afaik, Debian ships rust programs, and if I know anything about Debian packaging, static linking is not even close to being an option =]
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Wol
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
"Do you speak the ABI of versions in this range?"
"Not all of them, I can fall back to v.A.B.C as most recent. Is that OK?"
"Confirmed OK."
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
I view this trend as a good thing, btw. The modern languages have a lot going for them, and shared libraries really don't.
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
How? There will still be software that uses old flags, for the foreseeable future. You'll have to provide their emulation _somewhere_.
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
Filesystem-oriented flags: sad, messy and not going away
/usr/include/linux/fcntl.h:#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */
/usr/include/asm-generic/fcntl.h:#define O_NOFOLLOW 00400000 /* don't follow links */
/usr/include/x86_64-linux-gnu/bits/fcntl-linux.h:# define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */
/usr/include/x86_64-linux-gnu/sys/mount.h: UMOUNT_NOFOLLOW = 8 /* Don't follow symlink on umount. */
Filesystem-oriented flags: sad, messy and not going away