[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
|
|
Subscribe / Log in / New account

Filesystem-oriented flags: sad, messy and not going away

By Jonathan Corbet
March 16, 2020
Over the last decade, the addition of a "flags" argument to all new system calls, even if no flags are actually needed at the outset, has been widely adopted as a best practice. The result has certainly been greater API extensibility, but we have also seen a proliferation of various types of flags for related system calls. For calls related to files and filesystems, in particular, the available flags have reached a point where some calls will need as many as three arguments for them rather than just one.

One set of filesystem-oriented flags will be familiar to almost anybody who has worked with the Unix system-call API: the O_ flags supported by calls like open(). These flags affect how the call operates in a number of ways; O_CREAT will cause the named file to be opened if it does not already exist, O_NOFOLLOW causes the open to fail if the final component in the name is a symbolic link, O_NONBLOCK requests non-blocking operation, and so on. Some of those flags affect the lookup process (O_NOFOLLOW, for example) while others, like O_NONBLOCK, affect how the file descriptor created by the call will behave. All are part of one flag namespace that is recognized by all of the open() family of system calls.

open() is one way to create a new entry in a directory; link() is another. When the time came to add flags to link(), the linkat() system call was born; this system call also follows the other relatively new pattern of accepting a file descriptor for the directory in which the operation is to be performed. linkat() has a separate flag namespace (the "AT_ flags") with flags like AT_SYMLINK_FOLLOW, which is the opposite of O_NOFOLLOW. There is also an AT_SYMLINK_NOFOLLOW that is not recognized by linkat(), but which is understood by calls like fchmodat() and execveat(). There are more AT_ flags, such as AT_NO_AUTOMOUNT, supported by the relatively new statx() system call.

Then there is openat2(), which is coming with the 5.6 kernel. Rather than having a separate argument for flags, this system call requires a pointer to an open_how structure:

    struct open_how {
	__u64 flags;
	__u64 mode;
	__u64 resolve;
    };

Here, flags contains the O_ flags common to the open() family, while resolve contains yet another set of flags (the "RESOLVE_ flags"). These include RESOLVE_BENEATH to limit the lookup to files below the provided directory and RESOLVE_NO_SYMLINKS, which is kind of like O_NOFOLLOW or AT_SYMLINK_NOFOLLOW but different: it blocks symbolic-link traversal at all stages of pathname traversal, rather than just for the final component.

LWN has occasionally covered the ongoing story of the proposed fsinfo() system call, which provides information about mounted filesystems. This new API also includes a structure pointer as one of its parameters:

    struct fsinfo_params {
	__u32	at_flags;
	__u32	flags;
	__u32	request;
	__u32	Nth;
	__u32	Mth;
	__u64	__reserved[3];
    };

Here, at_flags is, as one would expect, a set of AT_ flags, while flags is yet another set of flags specific to this system call. Recently, though, fsinfo() author David Howells noted that he had been told that RESOLVE_ flags should be used in preference to AT_ flags in all new system calls, and asked whether the AT_ flags should be considered deprecated. He followed up with a patch marking the AT_ flags as being deprecated and adding new RESOLVE_ flags to cover behaviors that can currently only be requested by AT_ flags. So, for example, he added RESOLVE_NO_TERMINAL_SYMLINKS (later renamed RESOLVE_NO_TRAILING_SYMLINKS) to request the same semantics as AT_SYMLINK_NOFOLLOW.

Christian Brauner argued in favor of moving to RESOLVE_ flags, noting that some of the semantics that are only available via those flags may be of use in settings beyond openat(). He did allow, though, that "we might end up causing more confusion for userspace due to yet another set of flags" — though others might argue that it's a bit late to worry about that at this point.

Linus Torvalds, though, is not a fan of the plan to deprecate the AT_ flags; he noted that software will continue to use flags like O_NOFOLLOW or AT_SYMLINK_NOFOLLOW, so they can't go away. He added:

And yes, the fact that we then have three different user-visible namespaces (O_xyz flags for open(), AT_xyz flags for linkat(), and now RESOLVE_xyz flags for openat2()) is sad and messy. But it's an inherent messiness from just how the world works. We can't get rid of it.

Adding multiple flags that do the same thing leads to complexity and confusion, he said; one might thus conclude that any such patch is unlikely to make it into the mainline. He later said that, if fsinfo() needs features controlled by both AT_ and RESOLVE_ flags, it should accept both; that, along with the flags specific to that system call, adds up to three different sets of flags for one call. One could reasonably conclude that if, for example, openat2() were to implement a feature controlled by an AT_ flag, it would have to accept a third set of flags as well.

So the situation may indeed be "sad and messy", but it doesn't appear that it will be getting any less messy anytime soon. Perhaps one of the messiest aspects of this API is that there is no type checking for any of these flags fields. Nothing but due care prevents a developer from setting a flag in the wrong field. That one may be hard to correct in a backward-compatible way, even if somebody were to be motivated to do it. It is not the biggest mess to be found in our APIs; we'll continue to muddle on with things as they are.

Index entries for this article
KernelSystem calls


to post comments

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 1:15 UTC (Tue) by areilly (subscriber, #87829) [Link] (32 responses)

No doubt this sort of conflicting accretion of symbols and functionality is why Solaris eventually isolated its syscall interface with a shared library: possibility exists to support old code with compatibility versions while keeping kernel APIs flexible and cruft free. Not that I've ever looked at the Solaris kernel APIs to see if cruft was actually reduced.

Notionally also why the BSD executable versioning mechanism was introduced, although I don't know that that has ever been used as a mechanism to "tidy up" or unify older syscalls, rather than just providing OS emulation capabilities. The possibility exists though.

I like the way that the Rust (scheme) system handles language features and backwards compatibility: language version is an explicit part of the code preamble, which allows new code to link against old libraries without requiring the old libraries to be modified to match the current-version syntax or semantics. Something that neither C++ or python have managed during their history. All such a system requires is that whatever the new interfaces are, they must be capable of providing the original semantics somehow, so that the interface shim layer can re-implement the old API in terms of the new. Clearly you want a modicum of stability too, to minimize the number of old shims that you have to maintain.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 1:56 UTC (Tue) by wahern (subscriber, #37304) [Link] (12 responses)

Rust also only supports static linking (simple C ABI interfaces notwithstanding), so as compared to C++ it's an easier problem space.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 2:19 UTC (Tue) by tux3 (subscriber, #101245) [Link] (11 responses)

Rust does dynamic linking (and not just C), the caveat I think is that the ABI is not stable between compiler releases.
This is notably more unstable than the C++ ABI, which (visual studio excepted) only breaks for major events like the C++11 release.

Afaik, Debian ships rust programs, and if I know anything about Debian packaging, static linking is not even close to being an option =]

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 2:34 UTC (Tue) by Conan_Kudo (subscriber, #103240) [Link] (10 responses)

Afaik, Debian ships rust programs, and if I know anything about Debian packaging, static linking is not even close to being an option =]

Unfortunately, you'd be wrong. Debian ships packages with piles of source code, just as Fedora does. Applications statically link everything, because otherwise every rebuild of the compiler would necessitate rebuilding everything. It's just not practical. Maybe one day, the Rust community will care about us and work toward defining a native stable ABI. But I won't hold my breath. The Rust community thinks it's okay to have to constantly build everything for every change, despite the huge downsides.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 10:25 UTC (Tue) by roc (subscriber, #30627) [Link] (4 responses)

The stability of the C++ ABI also has massive downsides: https://cor3ntin.github.io/posts/abi/

In theory we could escape the dilemma by creating a stable ABI for shared libraries that you opt into at build time that would mostly be only for Linux distros. But even that would constrain language and library evolution as well as being a ton of work that no-one is really motivated to do.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 11:53 UTC (Tue) by Wol (subscriber, #4433) [Link]

The problem is too many developers only develop to solve the immediate problem. Spend a bit of time to define the *general* problem, define a state table and design the API to solve said state table, and then by all means just solve your bit of it.

That way, you can extend the function to fill the state table as and when, without having to redesign the interface.

Cheers,
Wol

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 21:21 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (2 responses)

> In theory we could escape the dilemma by creating a stable ABI for shared libraries

COM solved that problem decades ago. We should seriously consider adopting something a lot like it. A stable object ABI that allows for both efficient intraprocess calling and extensible interprocess remoting is extremely powerful.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 18, 2020 8:53 UTC (Wed) by roc (subscriber, #30627) [Link] (1 responses)

Lowest-common-denominator ABIs like COM are awful to work with.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 18, 2020 11:07 UTC (Wed) by k3ninho (subscriber, #50375) [Link]

We could manage versioning with a request broker*.
"Do you speak the ABI of versions in this range?"
"Not all of them, I can fall back to v.A.B.C as most recent. Is that OK?"
"Confirmed OK."

*: common object request broker isn't a model, it's an architecture ;-)

K3n.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 21:33 UTC (Tue) by rvolgers (guest, #63218) [Link] (4 responses)

That "huge downsides" article is not entirely fair perhaps.

Rust does link dynamically to libc, and many Rust programs link to e.g. OpenSSL because Rust has good support for using dynamically linked C libraries. In fact, there are dynamic libraries with a C ABI that are implemented in Rust (librsvg comes to mind).

Rust has really good support for dynamic linking! It just doesn't have good support for dynamic linking using its *native ABI*. You could look at this as discouraging dynamic linking, but you can also look at it as encouraging dynamic linking that integrates well with the rest of the open source ecosystem by using the C ABI as a universal interface.

Also, a ton of Rust code is just not desirable to dynamically link, ever. We could do a cute experiment and compile some popular Rust programs while absolutely forbidding the compiler to inline functions between different crates (i.e. "libraries"). Pretty sure that will cause a code size explosion and speed reduction that will make people scream a lot louder than using a couple more kb of disk space.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 21:44 UTC (Tue) by rvolgers (guest, #63218) [Link] (2 responses)

To clarify my point a bit more: the way native Rust APIs are written is just very different from how C APIs are written. Due to the rich type system you can have a lot more back-and-forth between a library and its consumers than you would have in a C API.

Consider for example the Iterator trait in Rust. People expect code written using iterators to compile down to something that you would find hard to distinguish from a C for loop in disassembly, which requires the compiler to inline a whole bunch of calls to tiny functions and remove some intermediate values. And not all those tiny functions have to come from the same library, they can come from many different ones, and many will have generic arguments or callbacks with generic arguments from still other libraries.

And it's not just Iterator, the same goes for asynchronous I/O using Futures, and probably more absolutely core functionality that I'm forgetting about right now. As soon as parts of that become dynamically linked, you start having to make some really tough calls about what the compiler can statically assume and optimize out.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 21:47 UTC (Tue) by areilly (subscriber, #87829) [Link]

Agree completely. You just beat me to it!

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 20, 2020 10:15 UTC (Fri) by jezuch (subscriber, #52988) [Link]

The rich type system also means that there are a lot of generic functions taking traits as inputs, which are not "real" types. So either you have a lot of monomorphization ("instantiation" in terms of C++ templates) where you substitute real types for the parameters, or you have dynamic dispatch, which is, well, akin to uncoditional surrender. Forcing dynamic linking also forces the latter, and this has huge impact on performance, maybe as much as impeded inlining has.

(Disclaimer: I speak from theory, not practice, so I may be more than a little wrong :) )

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 21:45 UTC (Tue) by areilly (subscriber, #87829) [Link]

In these days of flatpack and image-based application distribution, where applications ship with private versions of all of the shared libraries that they use, it's easy to argue that the days of shared libraries being particularly useful, at least for supposed benefits of disk space or memory space savings, are long gone. From a language point of view, the model of separately compiled object files is too restrictive, and too much of a barrier to efficient abstraction. Most modern languages have a whole-program compilation model, and that includes C++, except for the cases where modules effectively isolate themselves behind a C API. (Go, rust, julia, haskell, all of the lisps...)
I view this trend as a good thing, btw. The modern languages have a lot going for them, and shared libraries really don't.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 5:53 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (13 responses)

The problem is that these particular flags mirror the userspace. You will inevitably have the same mess _somewhere_, be it libc or the kernel.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 6:00 UTC (Tue) by areilly (subscriber, #87829) [Link] (12 responses)

Yes, at the moment, but API versioning is the tool that you need in order to rationalize and tidy-up down the track, without breaking the code that will inevitably have been built against the current, messy set.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 6:18 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (11 responses)

This is doable in Linux.

You can create a set of new syscalls with sane flags: openat_rational, link_rational, open_I_really_mean_it's_not_broken_this_time and so on. You then expose these new syscalls with their wonderful flags through libc, libc can also provide their emulation for the older kernels that lack the new syscalls.

Then after 20 years or so you can remove the old flags from libc, so that new code will be able to use the new flags. Then after another 10 years or so, the old syscalls can be removed from the kernel.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 6:29 UTC (Tue) by areilly (subscriber, #87829) [Link] (10 responses)

Yes, but the Solaris and BSD versioned syscall approach allows you to remove the old syscalls and their messy flags immediately, reducing complexity, size and technical debt in the kernel. For the extra win, the piece of libc code that has to provide the old API in terms of the new becomes a nice modular piece of API history that documents the change.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 6:33 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

> Yes, but the Solaris and BSD versioned syscall approach allows you to remove the old syscalls and their messy flags immediately
How? There will still be software that uses old flags, for the foreseeable future. You'll have to provide their emulation _somewhere_.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 6:37 UTC (Tue) by josh (subscriber, #17465) [Link] (4 responses)

On BSD, the syscall interface is subject to change, while the libc interface is stable. So it'd be libc's job to implement the old interface in terms of the new one.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 9:51 UTC (Tue) by smurf (subscriber, #17840) [Link] (3 responses)

This means that BSD's libc is tied to the kernel. Linux' libc is not.

A corollary is that statically-linked programs may or may not continue to work when you update your kernel, a notion which Linus emphatically rejects.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 10:25 UTC (Tue) by josh (subscriber, #17465) [Link]

I agree, and I prefer the Linux approach.

That said, it'd be interesting if we had a slightly more extensible syscall layer that could tell when an argument was passed or not passed, which would allow existing existing syscalls without having to create new ones.

It's looking increasingly like io_uring might be that extensible syscall layer.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 19, 2020 20:55 UTC (Thu) by BenHutchings (subscriber, #37955) [Link] (1 responses)

Indeed, there is no such thing as "Linux's libc". There's glibc, bionic, uclibc, musl, klibc, and at least one language run-time (Go) that doesn't depend on a C library.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 23, 2020 15:27 UTC (Mon) by gray_-_wolf (subscriber, #131074) [Link]

> least one language run-time (Go) that doesn't depend on a C library.

sometimes... would be nice if it never did but that is sadly not the case :/

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 6:47 UTC (Tue) by areilly (subscriber, #87829) [Link] (3 responses)

The solaris model is that the syscall API is a shared object (along with the rest of libc). User-space code doesn't get to make syscalls at all. The libc shared library can be versioned and indeed multiple, so you can (theoretically) keep older cruftier ones only as long as you have any executables that need them, on an install-by-install basis. I believe that OpenBSD is considering a similar scheme in order to have some sort of protection about where syscalls can come from, to prevent trampoline and gadget-style malware, perhaps.

The BSD versioned syscalls are in the kernel (so you can still have static executables), but they can be supplied by loadable kernel modules (as the linux and SCO syscalls are/were), which can eventually be deprecated or not loaded as suits the use-case, without getting (too much) in the way of the "fresh" syscall API.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 6:53 UTC (Tue) by josh (subscriber, #17465) [Link] (2 responses)

> The solaris model is that the syscall API is a shared object

How does Solaris provide that to userspace? Similar to the VDSO, or via a library provided on the filesystem that calls an unstable kernel interface?

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 7:41 UTC (Tue) by areilly (subscriber, #87829) [Link] (1 responses)

I'm afraid that I don't know. I had always assumed that it was a specially-blessed user-space library provided by the filesystem. I'm sure there are readers who know more about Solaris than I do (it wouldn't be hard).

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 18, 2020 21:37 UTC (Wed) by justincormack (subscriber, #70439) [Link]

Yes it was just a normal library - you could make syscalls elsewhere but they were neither documented or stable.

OpenBSD has been taking this model to a more modern design, where libc is blessed, and only it can make syscalls, by having a special attribute set. This is designed as a security measure, to stop arbitrary code using syscalls.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 7:02 UTC (Tue) by areilly (subscriber, #87829) [Link] (1 responses)

I was wondering why commenters were talking about rust but then I noticed that I mis-typed. Rust isn't a (scheme)! I meant racket of course. Sorry for the confusion!

I know that rust is versioning its releases too, but I don't know whether that actually allows for the linking and use of code written against different language versions, the way racket does. Racket code can import and use r5rs or r6rs or experimental-dialect code, which is cool. I suppose that C++ can do similarly for separately compiled object files, but it can't include old headers into new code, and python3 can't import python2 modules, which IMO is a terrible shame.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 15:15 UTC (Tue) by farnz (subscriber, #17727) [Link]

The equivalent in Rust is Editions; you can freely link code between different editions (currently only 2015 and 2018), but the compiler will translate each translation unit (crate in Rust) according to the edition you have specified for that crate.

Nothing, however, stops a Rust 2015 crate using a Rust 2018 crate as a dependency, or vice-versa, and you can freely share data types between the two editions. The only problem is that you might have to use r#identifier syntax if one crate uses a reserved word as an identifier.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 10:42 UTC (Tue) by roc (subscriber, #30627) [Link] (2 responses)

FWIW rr benefits enormously from the "stable kernel ABI" being the syscall interface. The same is true for strace and other tools that monitor and manipulate the user/kernel interface.

If stability guarantees applied at a shared library boundary like Solaris and Windows, then rr would have to choose between manipulating the unstable syscall interface or manipulating the shared library interface. The former would increase the rr maintenance burden considerably since we'd have to support every version of the syscall interface. The latter is difficult to do in a watertight way. Similar considerations would apply to strace etc.

I think it's also a great feature to have your stable ABI boundary enforced by hardware. On Windows people reverse engineer the syscalls and sometimes call them directly, bypassing the "stable ABI"; it's great that on Linux you simply *can't* bypass it.

Hardware being aware of the ABI boundary has other more esoteric benefits. For example rr needs to count performance events happening outside "the kernel"; on Linux the CPU supports us doing that, but on Windows/Solaris it doesn't if you consider that shared library to be "the kernel".

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 21:21 UTC (Tue) by sorokin (guest, #88478) [Link] (1 responses)

I heard kernel can inject some code into userspace. I don't know the details, but the keyword is vdso -- I think kernel hackers here know much more about it than I do.

Wouldn't presence of vdso pose the same problem as the stability guarantee applied at shared library boundary?

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 18, 2020 8:54 UTC (Wed) by roc (subscriber, #30627) [Link]

rr patches the VDSO to just do direct syscalls. It's as if the VDSO wasn't there.

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 17, 2020 18:42 UTC (Tue) by ale2018 (guest, #128727) [Link] (1 responses)

I would have said that the situation would look less messy if similar flags were set at the same bit. Having 64-bit flags seems to allow it. Hmm...

/usr/include/linux/fcntl.h:#define AT_SYMLINK_NOFOLLOW	0x100   /* Do not follow symbolic links.  */
/usr/include/asm-generic/fcntl.h:#define O_NOFOLLOW	00400000	/* don't follow links */
/usr/include/x86_64-linux-gnu/bits/fcntl-linux.h:# define AT_SYMLINK_NOFOLLOW	0x100	/* Do not follow symbolic links.  */
/usr/include/x86_64-linux-gnu/sys/mount.h:  UMOUNT_NOFOLLOW = 8		/* Don't follow symlink on umount.  */

Filesystem-oriented flags: sad, messy and not going away

Posted Mar 20, 2020 12:43 UTC (Fri) by draco (subscriber, #1792) [Link]

If they weren't done that way from the start, then that's an ABI change, which isn't allowed.


Copyright © 2020, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds