Defending mounted filesystems from the root user

By Jonathan Corbet
August 21, 2023

Making a filesystem implementation robust in the face of maliciously created filesystem images is a challenging task even when the implementation is actively maintained, which many in the kernel are not. There is a way to make that task even harder, though: modify that filesystem image behind the implementation's back while it is mounted. A recent discussion on the linux-fsdevel list reveals an ongoing disagreement over whether (and how) this threat should be addressed.

Gabriel Krisman Bertazi recently posted a patch series adding support for negative dentries on case-insensitive ext4 and F2FS filesystems. Negative dentries cache the results of lookups on files that do not exist, accelerating subsequent lookups. Since this kind of operation happens frequently (consider, for example, iterating through a PATH environment variable to find an executable), this is an important optimization. Currently, though, negative dentries do not work with case-insensitive filesystems; this patch series rectifies that problem.

In the review discussion for this series, Eric Biggers asked about a specific check for the case where an inode shows up with the case-insensitive flag set, even though the filesystem has not been mounted for case-insensitive operation. This check was added by Ted Ts'o in 2019 to fix a crash experienced while fuzzing the filesystem. Biggers wondered why the test was placed at the inode's point of use rather than when that inode is first read from the disk.

Ts'o answered that the inode can change after it has been read into memory, in certain conditions:

It's not enough to check it in ext4_iget, since the casefold flag can get set *after* the inode has been fetched, but before you try to use it. This can happen because syzbot has opened the block device for writing, and edits the superblock while it is mounted.

One might think that the case of writing to a mounted filesystem behind the implementation's back would be one of those "don't do that" situations. It is not an action that is going to lead to a satisfying conclusion. There is, however, disagreement over what should be done about this case; Ts'o continued:

One could say that this is an insane threat model, but the syzbot team thinks that this can be used to break out of a kernel lockdown after a UEFI secure boot. Which is fine, except I don't think I've been able to get any company (including Google) to pay for headcount to fix problems like this, and the unremitting stream of these sorts of syzbot reports have already caused one major file system developer to burn out and step down.

Biggers replied that fixing problems caused this way is the wrong approach:

Well, one thing that the kernel community can do to make things better is identify when a large number of bug reports are caused by a single issue ("userspace can write to mounted block devices"), and do something about that underlying issue instead of trying to "fix" large numbers of individual "bugs".

He pointed out that Jan Kara has posted a patch set that addresses that issue by adding a configuration option to prohibit writing to block devices that are currently mounted. Applying this series — and configuring the kernel appropriately — would simply close off that entire avenue of attack and, Biggers said, make a large number of syzbot-reported bugs go away.

There is a minor problem or two with this approach, though. One is that, as Kara describes in the cover letter, enabling this option breaks a number of things in both kernel and user space, including Btrfs mounting, loopback mounts, and filesystem resizing. Fixing these problems is seemingly not overly difficult, but one cannot just enable this option in the kernel until they have been fixed, and those fixes have found their way onto deployed systems. That is a process that will take years.

Even then, this series will prevent writing to a mounted partition, but not to the device as a whole. If /dev/sda1 is mounted it cannot be written to, but /dev/sda (which covers the whole device, including the sda1 partition) is still fair game. And even if that were fixed, as Ts'o pointed out, there are other possible attacks, such as opening the SCSI-generic device and sending commands directly to the storage device. There is, it seems, always another way for a sufficiently privileged account to create mayhem.

Yet another problem is that, according to Ts'o, the syzbot developers are unwilling to turn on this configuration option unless disabling it would be hidden behind a new CONFIG_INSECURE option (to indicate that doing so would make the system insecure). Ts'o objected to that positioning "because that's presuming a threat model that we have not all agreed is valid".

So, even if Kara's series is applied to the kernel, it is a partial (albeit worthwhile) fix that cannot be enabled in deployed systems for years, and which will not be enabled by the people running the fuzzers. Filesystem developers will be limited to occasionally fixing symptoms of the problem as they appear while dealing with floods of fuzzing reports and questioning the basis on which these reports are made. It seems fair to say that this is not a great situation for anybody involved.

The real problem, arguably, is that there is no consensus within the community regarding the threats that the kernel should try to address. A threat model that includes defending the system against its own root user will require a huge hardening effort that many developers feel is both impossible and pointless and which, in any case, does not have the funding it would need to have a chance at succeeding. The subset of the community that is pushing for this threat model thus finds itself in conflict with the rest. Resolving that disagreement may turn out to be the hardest problem of all.

Index entries for this article
Kernel	Filesystems/Security
Kernel	Security

Defending mounted filesystems from the root user

Posted Aug 21, 2023 17:51 UTC (Mon) by dullfire (guest, #111432) [Link] (9 responses)

I don't think there's any way to mount a defense (as implied here) against the root user. For example CONFIG_BLK_DEV_NBD, and probably CONFIG_ATA_OVER_ETH, CONFIG_VIRTIO_BLK, CONFIG_BLK_DEV_UBLK would all allow root to modify a "disk" fs, while it is mounted (even with the proposed work arounds).

In fact CONFIG_USB_CONFIGFS_F_FS could also be use as such (make "worse" by the fact that most distro will turn around and auto-mount your trojan "USB drive", without root even having to take that step).

In my humble opinion, this attempt is never going to work out, and there will always be glaring holes in attempts to "secure" a system that way.

It would be better(in that there's a possibility it might be achievable) to have a mode (sysctl/sysfs twiddle maybe) that prevent new processes with uid 0 (possibly enforced at exec time). Of course that would be hard. And probably abusive to the system owners well.

Defending mounted filesystems from the root user

Posted Aug 21, 2023 17:55 UTC (Mon) by dullfire (guest, #111432) [Link]

Small correction: CONFIG_USB_CONFIGFS_F_FS would also require CONFIG_USB_DUMMY_HCD to practically attack the same systems.

Defending mounted filesystems from the root user

Posted Aug 21, 2023 18:29 UTC (Mon) by intelfx (subscriber, #130118) [Link]

> And probably abusive to the system owners well.

I'd imagine that "being abusive to the system owners" is rather the point of all this commotion.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 10:39 UTC (Tue) by epa (subscriber, #39769) [Link] (6 responses)

It may be naive, but I think the right approach is to break out the things that can be done by uid 0 into capabilities (yes, adding more capabilities -- they really should not be a scarce resource) and then introduce a slightly lower-privileged "admin" user, uid 1, which can do most of the things you'd do as root, but not the most dangerous low-level stuff. And that might allow you to guarantee that "admin" cannot break out of a Secure Boot kernel, which sounds like a more reasonable threat model than trying to retrofit security restrictions on what was traditionally meant to be an unlimited-power God-mode user account.

Defending mounted filesystems from the root user

Posted Aug 23, 2023 2:57 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (5 responses)

The traditional design of Unix is that root (or whatever uid=0 is called) can do a relatively small set of things:

* Open any file regardless of permissions.
* Impersonate any user with setuid(2) (or some equivalent).
* Send any signal to any process, and make other adjustments to the process's state (such as renicing it).
* Mount and unmount filesystems.
* And probably a few other highly standardized actions (i.e. *not* Linux-specific things) I've forgotten about.

Problem is, traditionally, you couldn't actually design an OS where root could only do things like the above. You also needed an interface for doing more complicated stuff, and especially for doing things in kernelspace (loading modules, debugging, enabling realtime scheduling, etc.). There are a few ways around this, at least that I can think of:

* We could try to partition off the kernelspace-modifying actions into a separate user, as you suggest, or at least into a separate set of capabilities(7) or the like. The difficulty is that you'd probably have to break up CAP_SYS_ADMIN for this to work, so it would be a lot of code churn. Ultimately, I think the existing capabilities would have to be fundamentally redesigned for this to make sense. It is not enough to split off a permission here and a permission there - we have to think logically about the transitive closure of everything that a process with capability X can ever do, directly or indirectly, and the current design does not even attempt to do that. And then we have to think about all possible combinations of capabilities, or at least all combinations that can plausibly interact with each other to escalate privileges.
* We could say "if you want to modify your kernel, either don't enable secureboot, or reimage your kernel with the appropriate changes pre-configured." The effect would be to disable the kernelspace-modifying actions altogether, and maybe even patch out their codepaths entirely so that they can't be used as ROP gadgets, but only in secureboot-enabled kernels (so that people who "just want a normal kernel" and don't want to put up with this sort of thing can ignore it). The main difficulty here is that, to my understanding, much of the existing "pre-configure your system" tooling currently lives in userspace (e.g. systemd). You'd probably need to provide a rich set of kernelspace configuration options that can be set before the system is first booted, and I'm not sure how feasible that is.
* We could partition off all of the "dangerous" permissions into a series of daemons like systemd and polkit, and administer the system by asking those daemons nicely to do it for us. That would extend secureboot trust to a much wider array of system services, which is probably undesirable (now your systemd has to be secureboot-signed?). OTOH, it's not like Microsoft maintains a strong segregation between the Windows NT kernel and the modern Windows userspace. From their perspective, I imagine the trusted component is a pretty large subset of the operating system, and I doubt they draw the line exactly at ring 0.

Defending mounted filesystems from the root user

Posted Aug 23, 2023 3:47 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> From their perspective, I imagine the trusted component is a pretty large subset of the operating system, and I doubt they draw the line exactly at ring 0.

Microsoft has a notion of "protected processes" that block every access to themselves, even from the Administrator user. Linux doesn't really have a similar thing. The root user can trivially ptrace any process.

Defending mounted filesystems from the root user

Posted Aug 23, 2023 6:36 UTC (Wed) by epa (subscriber, #39769) [Link]

The difficulty is that you'd probably have to break up CAP_SYS_ADMIN for this to work

Exactly right. CAP_SYS_ADMIN is the "big kernel lock" of permissions. Or it's fcntl(). Or any other design that started out as a reasonable idea but became more and more overloaded and treated as a receptacle for anything and everything.

Defending mounted filesystems from the root user

Posted Aug 23, 2023 9:20 UTC (Wed) by Wol (subscriber, #4433) [Link]

> Problem is, traditionally, you couldn't actually design an OS where root could only do things like the above. You also needed an interface for doing more complicated stuff, and especially for doing things in kernelspace (loading modules, debugging, enabling realtime scheduling, etc.). There are a few ways around this, at least that I can think of:

Going back to Pr1mos, the ONLY thing that was hard-coded into the OS (and even that could be patched out) was that user "system" could edit the root of the permissions tree. And not really even that - it simply set over-ride permissions, which I would often use when testing stuff ...

SPAC <system> wol:none
SPAC <data> wol:none

then I would run loads of stuff in testing that could cause carnage if I'd made a mistake, secure in the knowledge that the live system was not even visible to my program.

Cheers,
Wol

Defending mounted filesystems from the root user

Posted Aug 27, 2023 14:04 UTC (Sun) by Baughn (subscriber, #124425) [Link]

> * We could say "if you want to modify your kernel, either don't enable secureboot, or reimage your kernel with the appropriate changes pre-configured."

I have a computer that doesn’t boot with Secureboot disabled. They seem to be getting more common.

At the moment, I’m still able to use it as a regular computer thanks to Linux not locking itself down hard enough to stop me modifying the kernel. If a rule like that was added, then i suppose it’s game over.

Defending mounted filesystems from the root user

Posted Aug 28, 2023 14:03 UTC (Mon) by jwarnica (subscriber, #27492) [Link]

We could do all of those things, or we could accept that some kernel being operated by some human running on real or virtual hardware requires a level of trust of that human.

It's a weird mental model that "root is special, protect it". See: https://xkcd.com/1200/ In a more enterprisy sense: consider that some app team has full permissions to /var/lib/pgsql, but the OS team has root, so the app team needs to open a ticket to restart the server. Yah! I guess the app team isn't able to put a NIC in promiscuous mode, but who isn't using switches?

Presume that which ever human runs the kernel has access to everything; that is either tolerable trust or a massive breach depending on the organizational requirement. And then protect the kernels running, from each other. Harden the VM layer, harden the network layer, harden the APIs.

Defending mounted filesystems from the root user

Posted Aug 21, 2023 19:56 UTC (Mon) by leromarinvit (subscriber, #56850) [Link] (7 responses)

I don't think this class of attack is only about root. Even if it were possible to perfectly prevent all local users from writing to the backing store of a mounted FS, there's no defense against malicious hardware. Emulate a USB storage device with a $2 microcontroller board, and you can change everything you like at any time you want - and no one can possibly stop you.

I'm not sure I can see any workable solution for this other than carefully auditing all involved kernel code with that attack vector in mind. Even disabling USB or forbidding mounting removable media is not a 100% remedy, it only moves the bar up a bit (never mind such a setup not being terribly likely to be welcome for many use cases). A sufficiently motivated attacker might as well present this malicious device via a SATA or NVMe interface - maybe a bit more complicated and less ready-made hardware and software stacks to choose from, but not impossible.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 1:30 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (2 responses)

I Googled [SATA to USB dongle] and got approximately 1 bajillion results. Can't you just emulate one of those devices, and lie about what's plugged into the (fictitious) "other end"? Or are those devices actually implementing a full SATA stack and presenting themselves as USB drives? I think the latter seems a bit excessive for something that can be had for under $10.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 7:45 UTC (Tue) by geert (subscriber, #98403) [Link]

IIRC, the underlying transport for SATA and USB storage is very similar.

I can easily imagine a small and cheap device with a USB host and a USB device connector, which sits between the computer and a USB memory stick, introducing (not so) random corruptions to data read from the memory stick to attack the host.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 8:53 UTC (Tue) by pbonzini (subscriber, #60935) [Link]

A SATA-to-USB adapter is basically a SoC that implements both physical interfaces, plus some software that does SCSI-to-ATA emulation (because USB storage is based on SCSI). Likewise for microSD readers, except it's SCSI-to-SD of course. In both cases the cost of the hardware wildly dominates since software only has to be written once.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 16:03 UTC (Tue) by zeno_kdab (guest, #165579) [Link] (2 responses)

If the attacker has enough physical access to plug in a malicious SATA or NVMe device, isn't it rather too late to worry about security? I'd think at that point there are plenty hardware based attacks possible that no OS could defend against anyway.

For external devices maybe an idea would be to just use unprivileged FUSE to mount? It seems rather unlikely to have a use case where you need maximum FS performance but at the same time can't trust your hardware...

Defending mounted filesystems from the root user

Posted Aug 23, 2023 14:03 UTC (Wed) by draco (subscriber, #1792) [Link] (1 responses)

Not necessarily. As an analogy, let's say that the block device is cloud storage. The cloud storage has different threats than the rest of the computer.

It's fair to say that in a scenario where you're computing in malicious environments that you must be able to trust some of your hardware — if you can't trust the CPU itself, you're doomed, sure. But with a trusted computing core and IOMMU, you can (in principle) mitigate malicious I/O if you write the drivers defensively.

Defending mounted filesystems from the root user

Posted Aug 23, 2023 17:21 UTC (Wed) by zeno_kdab (guest, #165579) [Link]

I'll agree that it does seem theoretically possible to do so. Though I am doubtful that it is a good idea, besides the already mentioned concern of practical feasibility.

Imho either you trust your hardware, and don't want your FS drivers to be slowed down by being implemented super defensively, always rechecking everything etc. Or you don't trust, but then you should be fine taking the perf hit by using FUSE or a VM to isolate the hardware handling from your host kernel.

Having said that, I always dream about a new OS kernel that transcends the monolithic/micro-dichotomy by easily allowing to move all kinds of driver into userspace and back ;)

Defending mounted filesystems from the root user

Posted Aug 22, 2023 20:06 UTC (Tue) by zorg24 (subscriber, #138982) [Link]

The issue of automounting USB drives was actually discussed in a previous article https://lwn.net/Articles/939097/

Defending mounted filesystems from the root user

Posted Aug 21, 2023 20:47 UTC (Mon) by amarao (subscriber, #87073) [Link] (13 responses)

Why network doesn't have such problem? Arbitrary junk hit the fan on NIC, and everything breaking RFC is just dropped.

Can the same thing be done for fs? Storage layer, which guarantee correctness of data for the next layer.

May be filesystem is like HTTP working with Ethernet frames...

Defending mounted filesystems from the root user

Posted Aug 21, 2023 21:29 UTC (Mon) by leromarinvit (subscriber, #56850) [Link] (10 responses)

The layer that would need to drop invalid on-disk structures is the very FS code itself. Nothing else knows what is conforming and what isn't. To keep with the network analogy, the Ethernet layer can't defend a buggy HTTP parser against application-level attacks.

Network protocols are usually designed to be easy to validate (at least sane ones), and nothing terrible typically happens when you drop non-conforming packets sent from a buggy, non-malicious source. And yet, even in networking, the common approach used to be "conservative in what you send out, liberal in what you accept" until relatively recently.

File systems, OTOH, are usually designed with performance as the main goal, and malicious images aren't the top concern (and they certainly weren't when most of today's widely used file systems were designed). Also, resilience against corruption is a concern often diametrically opposed to strict validation - if corruption causes 10% of my files to contain garbage (or turns 10% of a single corrupted file into garbage), I'd much prefer my FS to give me the remaining 90% without fuss.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 11:57 UTC (Tue) by pizza (subscriber, #46) [Link] (9 responses)

>Nothing else knows what is conforming and what isn't.

And it's often only possible to tell if a given on-disk metedata structure is "conformant" after loading *every other* bit of metadata into memory and effectively doing a full consistency/fsck pass. Of course you're still vulnerable to stuff being written to disk behind your back, so the only way to handle that is to always keep the full metadata in memory, and never re-read anything from disk.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 13:39 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (8 responses)

But then you have to write all of it back because it might have changed on disk behind your back. Anything less and you're just deferring discovering the bogus writes until the next mount time.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 14:10 UTC (Tue) by leromarinvit (subscriber, #56850) [Link] (3 responses)

Trying to detect writes to a device (or the page cache) behind the file system's back is probably a futile endeavor. But making sure to never crash, no matter what any given read returns (and no matter if that's consistent with what was read elsewhere), seems like a goal that should be attainable in principle (even if, as many have said, it's a lot of work).

Defending mounted filesystems from the root user

Posted Aug 22, 2023 16:08 UTC (Tue) by DemiMarie (subscriber, #164188) [Link]

Exactly! And big companies (Google, Oracle, Red Hat, etc) need to hire more people to meet that goal.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 17:19 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

That sounds like a futile endeavor to me. Sure, make things *better*, but *crash-proof* when you're in the realm of forced TOCTOU races (that iSCSI situation given above), deliberately bad actors messing with inode pointers/refcounts that may cause page cache confusions, or whatever else could possibly go wrong when you're in an absolutely uncontrollable and hostile environment…

I don't know…the trust line has to go somewhere here. For example, Rust is not safe against `/proc/self/mem` editing. I'm not sure what one *could* do in the face of such power because the only thing you have is "my registers are not accessible" and "the program counter will keep moving".

Note that I am usually all about defensive programming and covering bases, but I also don't interface with hardware directly and have some baseline level of viable behavior. The tales I've heard here (and from linked blogs, etc.) make me happy about my course so far. I am extremely grateful for those that do that work, but I do not envy their jobs.

Defending mounted filesystems from the root user

Posted Aug 23, 2023 17:13 UTC (Wed) by leromarinvit (subscriber, #56850) [Link]

Sure, perfection is impossible for anything sufficiently complex. What I really meant was fixing the issues one knows about, and having the attack vector in mind when designing and writing new code. Not saying the current maintainers have to tackle all that in addition to everything they're already doing - it's clear that adding more work either leads to everything moving at a slower pace, or someone needs to step in and fund more developers. (Someone volunteering is of course also possible, but I have a hard time imagining that "I can use root privileges to make the kernel do funny things" is many people's most important itch to scratch.)

I also should probably have qualified the "never crash" with "in a way that potentially allows privilege escalation". If removable media were by default mounted using something like lklfuse, that would IMHO be a big step in the right direction. But I think this should be mainlined, or decoupled from the actual driver code so much that it can use arbitrary kernel images or modules. Using different versions of the same fs driver (with a different set of features and bugs), potentially interchangeably on the same device, sounds like a recipe for compatibility issues.

Defending mounted filesystems from the root user

Posted Aug 24, 2023 7:12 UTC (Thu) by Karellen (subscriber, #67644) [Link] (3 responses)

Anything less and you're just deferring discovering the bogus writes until the next mount time.

Why is that a problem?

Defending mounted filesystems from the root user

Posted Aug 24, 2023 12:33 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (2 responses)

Let's say you bring in all of the metadata from the FS into memory and work from there. If you don't edit any of them, there's no need to write. However, in a situation where the backing store is editable by some other mechanism (network-mounted block device, direct writes, whatever, these can be written without noticing (say, swapping two inodes in a directory listing). Without writeback, you're just deferring "something edited my FS" problems from "direct memory access" to "when I load from disk next time".

Defending mounted filesystems from the root user

Posted Aug 24, 2023 14:16 UTC (Thu) by Karellen (subscriber, #67644) [Link] (1 responses)

you're just deferring "something edited my FS" problems from "direct memory access" to "when I load from disk next time".

I get that. I just don't see why it's a problem. Surely checking for consistency and deciding what to do if there's a problem is easier at mount time than it is while the filesystem is in use?

Defending mounted filesystems from the root user

Posted Aug 30, 2023 9:45 UTC (Wed) by taladar (subscriber, #68407) [Link]

At that point you are potentially working on a fictional version of your filesystem that doesn't exist on disk for months at a time. Considering persistent storage is the main purpose of a hard disk that doesn't seem like a good idea.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 1:49 UTC (Tue) by geofft (subscriber, #59789) [Link] (1 responses)

Basically, filesystems are bigger than packets.

The reason the attacks here are about data in the superblock and not e.g. within an inode is because you can reasonably cache a little bit of data in from the block device and then validate it once you've read it. Maybe you load a page worth of data, and then you validate its layout, and then you can use the validated page. For instance, maybe there's a uint32 that specifies how long the filename is, which is restricted by spec to something more reasonable like 1024 bytes. If you've already copied the data into memory you trust, you can check it and then have other functions use it directly without worrying about them doing a kmalloc(4G).

For a network protocol parser (at any layer), that's all it does! It's received some bytes from the network into RAM, and then the authoritative copy of the data is in your own trusted RAM for you to handle as you like. You can parse it and interpret it and pass it on, or you can drop it. Then you get more bytes from the network. Even if you're receiving a large amount of data, you're handling one packet at a time, and each packet becomes fully yours when you receive it.

For filesystems, you have terabytes of data that you're repeatedly going back to. There's a lot of structure of superblock to directories to inodes to data. Not all of those blocks stay in memory. So maybe you've read a superblock once, determined that it's valid, and then it changes and for whatever reason the superblock is no longer in memory. Then the next function down the line might not get the same bytes that you validated. You can't copy the entire filesystem into RAM up front because half the point of a filesystem is to be bigger than what you can fit in RAM. You can't parse things as you receive them because you're doing random access.

You _can_ revalidate data each time you need it, but the argument being made is that writing code this way is a very unnatural and unpleasant experience.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 12:12 UTC (Tue) by SLi (subscriber, #53131) [Link]

All this discussion makes me think about one of the things I'll eventually maybe get up to doing once I've done all the other important things in the world.

To me, filesystems are in many ways an exceptionally nicely contained thing. They largely follow a well defined, narrow API with well defined semantics. Exceptions to it are probably fairly easy to express. Regardless of the filesystem, you can say things like "if I write a file, then read it back without other writes to the same file, I should get the same data (or an error) back".

That is, they seem exceptionally amenable to formal specification and analysis, and from that perspective, how they are designed today seems quite ad hoc. It shouldn't be as hard as with many other systems to actually formally define the operations (up to what gets written to the disk where) and verify that the requires properties hold, as well as do a lot of analysis on performance etc., play with different design ideas without needing to convert and boot kernels, etc. You could treat tolerance to bogus data in the same way, allowing a conscious decision on exactly how you are allowed to fail in different situations.

Now I'm not saying that should necessarily be the same as the code that gets executed (or even generated from it), but parts of it could well be if desired. Verifying the design should give quite a bit of confidence, and effort could be directed at the performance critical parts.

Defending mounted filesystems from the root user

Posted Aug 21, 2023 21:07 UTC (Mon) by flussence (guest, #85566) [Link] (7 responses)

If it's doable _with a reasonable amount of effort_, it may as well be done. Every threat model like this sounds insane until someone invents a better Rowhammer and suddenly it's the easiest way to get root.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 0:36 UTC (Tue) by kmeyer (subscriber, #50720) [Link] (6 responses)

You've touched upon the issue -- it is not tractable with a reasonable amount of effort. Writing a filesystem for a byzantine storage media is a huge effort, if it is even possible, and adapting historical filesystems not designed around byzantine media is Sisyphean.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 1:28 UTC (Tue) by Paf (subscriber, #91811) [Link]

I think it’s obviously *possible*, but I agree we’re not well prepared for it (I’m a file system developer) and it may require a different threat model at time of design. Not fun.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 2:04 UTC (Tue) by geofft (subscriber, #59789) [Link] (4 responses)

I think sufficiently-Byzantine filesystems just need to get deprivileged, honestly.

https://github.com/lkl/linux is a fork of the Linux kernel that turns all the interesting routines into a library, with a couple of neat tech demos of what you can do with it - including a FUSE wrapper for the filesystems in the kernel. So any filesystem that's already been implemented once, in the kernel, now has a userspace version.

There's also other ways to do it, such as UML or hardware-assisted virtualization.

Yes, you will lose some performance. I think the triangle of security, performance, and nicheness is a "pick two" situation - if you want both security and performance, you will need to attract enough interest and enthusiasm to pick up the work, possibly defining newer and easier-to-handle on-disk formats as the work happens. Otherwise, you can use an old implementation that made sense in the '90s at full performance with the security of the '90s, or you can use it at the performance of the '90s (which should be enough, honestly!).

(I'd also be very curious to see what the actual performance loss is even for day-to-day filesystems, and whether there are things that can be done to address performance like reviving the zero-copy FUSE patchset. I think I actually do very few things that are ridiculously sensitive to filesystem performance per se: most of the time I'm either working with large single files like giant CSVs or git pack files or game textures, for which the filesystem is essentially a constant factor and it's the raw I/O performance that matters, or reading and writing lots of small files like source code, which can mostly stay in the VFS cache, in theory. Applications that care very much about disk performance, like databases, tend to make a large contiguously-allocated single file anyway - and they subdivide it in userspace.)

Defending mounted filesystems from the root user

Posted Aug 22, 2023 9:48 UTC (Tue) by khim (subscriber, #9252) [Link]

In a sane world we would have both. FUSE-filesystem to deal with USB or other untrusted sources and in-kernel implementation for root fs.

Defending mounted filesystems from the root user

Posted Aug 24, 2023 10:29 UTC (Thu) by farnz (subscriber, #17727) [Link]

The only significant gotcha is that which implementation to use (userspace or kernelspace) is not about the filesystem in use, but rather about the degree to which the backing storage and the user are trustworthy.

In one system, I might want to use both the kernelspace implementation of xfs for my root FS, using something like fs-verity to protect against a malicious root user, and the userspace implementation for home directories. For added fun, I might want the userspace implementation to run multiple instances, so that an exploit is less likely to affect other instances (only affects other instances if it can be used to write to the backing store); this comes in handy with something along the lines of Android's user-per-application model, where I won't be able to mutate in-memory state that affects another application.

Defending mounted filesystems from the root user

Posted Aug 27, 2023 23:32 UTC (Sun) by kmeyer (subscriber, #50720) [Link] (1 responses)

I mean, we're talking about ext4, xfs, and btrfs. Punting them to FUSE + LKL isn't really an option, IMO.

Defending mounted filesystems from the root user

Posted Aug 28, 2023 0:31 UTC (Mon) by geofft (subscriber, #59789) [Link]

I was mostly replying to the comment talking about historical filesystems - if we need to write an ext5 that makes this sort of robustness easier, that's certainly an option. We are, in any case, mostly just talking about lockdown/secure boot configs - it seems pretty reasonable to say, the upstream kernel will only let you use a very small set of filesystems when lockdown is on, and if distros want to take the risk of filesystems that haven't been audited, they can patch this check out. A lot of people run without kernel lockdown enabled (perhaps enforcing secure boot in some other way, like verifying a signature on an entire read-only rootfs) and won't be impacted by this. The people who do run with kernel lockdown should be able to expect that the kernel isn't just writing off ring-0 escalation risks.

But also I don't think punting major filesystems to FUSE is really out of the question. It was the vision of the microkernels of the '90s, which failed not because there was anything fundamentally wrong with microkernels but because overhead was high. We've learned a lot about writing efficient software that spans multiple address spaces since then (it's in many senses similar to HPC work or GPU programming), and also the physical computers are way faster. As I mentioned, without an actual benchmark, I think saying that this just has to be done in kernelspace is premature optimization.

(We also know a lot more about software fault isolation now than we did in the '90s - we could use something like eBPF or wasm or Native Client to keep these filesystems in the kernel but limit the impact of bugs.)

We Linux folks rightly make fun of Windows for having done font rendering in the kernel for so long and having had a bunch of ring-0 privilege escalation bugs as a result. It made sense in the '90s when they cared a lot about font rendering performance and basically not at all about malicious fonts; it doesn't make sense today. I don't think filesystems are a fundamentally different story.

Defending mounted filesystems from the root user

Posted Aug 21, 2023 21:44 UTC (Mon) by jengelh (subscriber, #33263) [Link] (1 responses)

>syzbot has opened the block device for writing, and edits the superblock while it is mounted. [...] writing to a mounted filesystem behind the implementation's back would be one of those "don't do that" situations.

Well, syzbot just needs to find a code path which makes the filesystem implementation itself issue the destructive write. Then everyone will jump to fix it. :-)

Defending mounted filesystems from the root user

Posted Aug 25, 2023 21:24 UTC (Fri) by calumapplepie (guest, #143655) [Link]

Yes... they would fix the filesystem issuing a destructive write. That is a bug. That is bad.

A filesystem failing to handle concurrent modification is less of a bug.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 4:41 UTC (Tue) by ebiggers (subscriber, #130760) [Link]

This article misses an important point, which is that the specific issue being discussed is writes to the block device's **page cache** while the filesystem is mounted. It's virtually impossible for filesystems to maintain memory safety in that case. Whereas it's possible (but difficult) for filesystems to maintain memory safety when their underlying storage changes.

It is helpful to not conflate these two cases. This makes it clear why it's useful to e.g. forbid writes to /dev/sda1 while still allowing /dev/sda. Even just forbidding buffered writes would solve this problem; O_DIRECT writes could still be allowed.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 13:43 UTC (Tue) by magfr (subscriber, #16052) [Link] (1 responses)

For added fun you can put the file system on an iSCSI device, mount it from two computers concurrently, and then start writing from both.

I do not expect the kernel to handle that scenario.

Defending mounted filesystems from the root user

Posted Aug 22, 2023 16:13 UTC (Tue) by willy (subscriber, #9762) [Link]

... unless it's GFS2 or OCFS which are designed for exactly that use case ;-)

Defending mounted filesystems from the root user

Posted Aug 23, 2023 5:03 UTC (Wed) by mcassaniti (subscriber, #83878) [Link] (2 responses)

Disabling the ability to modify the whole block device means that new partitions cannot be created live (think expanding a VM disk), not can a partition be extended. While systemd isn't everyone's favourite, systemd-sysupdate can change the partition table and overwrite non-mounted partitions as part of an A/B update process. It's likely not the only tool to do so either.

Defending mounted filesystems from the root user

Posted Aug 25, 2023 2:19 UTC (Fri) by smammy (subscriber, #120874) [Link] (1 responses)

It's always struck me as a little odd that we interact with partition tables by raw block device access from userspace. Surely it would be reasonable for the kernel to have partition table drivers with an API for manipulating them. Presumably calls to modify or delete a mounted partition would fail, while calls that add new partitions or modify/delete unmounted partitions could succeed.

Defending mounted filesystems from the root user

Posted Aug 29, 2023 3:15 UTC (Tue) by matthias (subscriber, #94967) [Link]

> Surely it would be reasonable for the kernel to have partition table drivers with an API for manipulating them. Presumably calls to modify or delete a mounted partition would fail, while calls that add new partitions or modify/delete unmounted partitions could succeed.

In a way, this is the case. Think of the on disk partition table as a configuration "file" that tells the kernel how to configure its internal partition table. The API allows to re-read this configuration after userspace has changed it on disk. And it will fail, it the kernel thinks this is not safe to do. Back in the days, it was entirely impossible to re-read a partition table if any filesystem on the disk was mounted, always requiring a reboot if one modified the partition table on the primary disk. Nowadays it is a bit more permissive.

You just have to mentally differ between the kernel partition table (which is always in RAM) and the partition table on disk. Changing the latter one is no issue at all, as it will only be used by the kernel when explicitly told so or on the next boot. And this design makes a lot of sense. You can do modifications on disk that are only safe to apply after the next boot and then reboot. If the only ways of changing the on disk partition table where by means of an API that directly manipulates the internal partition table such changes would always require to boot another OS (rescue CD etc.).

Defending mounted filesystems from the root user

Posted Aug 25, 2023 22:49 UTC (Fri) by calumapplepie (guest, #143655) [Link]

> Yet another problem is that, according to Ts'o, the syzbot developers are unwilling to turn on this configuration option unless disabling it would be hidden behind a new CONFIG_INSECURE option (to indicate that doing so would make the system insecure). Ts'o objected to that positioning ""because that's presuming a threat model that we have not all agreed is valid"".

The threat model of "root is evil" is apparently a valid one supported by kernel_lockdown(7). However, it isn't valid unless the filesystem is booted up in lockdown mode: if it isn't, root can just use kmem and such. At a minimum, we could gate edits to devices containing mounted filesystems and on touching the SCISI_GENERIC device behind a requirement that the kernel isn't locked down. Kernel lockdown already breaks a number of things; what's a few more?

Alternatively, start with a CONFIG_LOCKDOWN_STRICT option, which when enabled tightens the restrictions of lockdown to prohibit such things as mounted block-device writes. For those users who require a root -> kernel barrier, they can enable that option, and with it some more restrictions that might break semi-niche application code. Yes, I'm considering online resizing to be 'semi-niche'. If you really require the security guarantees of a strict lockdown, you enable the config; otherwise, leave it disabled.

For those users who are just running a distro kernel, which enables CONFIG_LOCKDOWN_LSM but not STRICT because they want all the features available, this means that (for a period of time) they will be vulnerable to novel attacks using this threat model. However, the goal will be to move this patch into the basic CONFIG_LOCKDOWN eventually; thus fixing all such bugs. As we do so, we can add additional hardening behind CONFIG_LOCKDOWN_STRICT, for instance disabling a wider variety of sysfs files or locking down old drivers. You can also remove the ability to disable the lockdown LSM on the command line; a command line which can be edited for the next boot by root on most machines.

This two-phase mechanism ensures that those who want a strict lockdown will need to deal with the breakage that it causes in userspace. Those who don't need a strict lockdown, but enable lockdown anyways for hardening get to benefit over time from the work of those who need a stricter mode. It's similar to the realtime stuff; if you want a realtime kernel, you have to configure yourself a realtime kernel. If you want a kernel that actually blocks all ring0 compromise, then you have to build it yourself.

In other words: There are some folks who actually want this threat model secured, and many more who don't really care but appreciate the hardening it produces. Differentiate between the two with config options, document the difference in all the places that talk about lockdown, and let those who want it strictly secured deal with the breakage and performance regressions from it.

TLDR: Make the security model of the kernel a kconfig option, and limit features for those using the root-is-evil threat model until those features can be made secure.

Defending mounted filesystems from the root user

Posted Mar 11, 2024 2:26 UTC (Mon) by lathiat (subscriber, #18567) [Link] (1 responses)

Ignoring all of the TOCTOU issues, this same functionality seems super helpful for preventing users from shooting themselves in the foot when trying to write an image to (the wrong) disk. Though would probably need to also protect the full device not just the partitions.

This would be a nice default anyway, even if it has some kind of override method for the weird cases.

Defending mounted filesystems from the root user

Posted Mar 11, 2024 5:09 UTC (Mon) by adobriyan (subscriber, #30858) [Link]

> for preventing users from shooting themselves in the foot when trying to write an image to (the wrong) disk.

or developers from running fio job with wrong filename=
:-(