Read-only bind mounts

By Jonathan Corbet
May 6, 2008

Bind mounts can be thought of as a sort of symbolic link at the filesystem level. Using mount --bind, it is possible to create a second mount point for an existing filesystem, making that filesystem visible at a different spot in the namespace. Bind mounts are thus useful for creating specific views of the filesystem namespace; one can, for example, create a bind mount which makes a piece of a filesystem visible within an environment which is otherwise closed off with chroot().

There is one constraint to be found with bind mounts as implemented in kernels through 2.6.25, though: they have the same mount options as the primary mount. So a command like:

    mount --bind -o ro /vital_data /untrusted_container/vital_data

will fail to make /vital_data read-only under /untrusted_container if it was mounted writable initially. On your editor's 2.6.25 system, the failure is silent - the bind mount will be made writable despite the read-only request and no error message will be generated (the mount man page does document that options cannot be changed).

There is clear value in the ability to make bind mounts read-only, though. Containers are one example: an administrator may wish to create a container in which processes may be running as root. It may be useful for that container to have access to filesystems on the host, but the container should not necessarily have write access to those filesystems. As of 2.6.26, this sort of configuration will be possible, thanks to the merging of the read-only bind mounts patches by Dave Hansen.

As it happens, it's still not possible to create a read-only bind mount with the command shown above; the read-only attribute can only be added with a remount operation afterward. So the necessary sequence is something like:

    mount --bind /vital_data /untrusted_container/vital_data
    mount -o remount,ro /untrusted_container/vital_data

This example raises an interesting question: what if some process opens a file for write access between the two mount operations? A system administrator has the right to expect that a read-only mount will, in fact, only be used for read operations. The 2.6.26 patch is designed to live up to that expectation, though the amount of work required turned out to be more than the developers might have expected.

Filesystems normally track which files are opened for write access, so an attempt to remount a filesystem read-only can be passed to the low-level filesystem code for approval. But the low-level filesystem knows nothing about bind mounts, which are implemented entirely within the virtual filesystem (VFS) layer. So making read-only access for bind mounts work requires that the VFS keep track of all files which have been opened for write access. Or, more precisely, the VFS really only needs to keep track of how many files are open for write access.

The technique chosen was to create something which looks like a write lock for filesystems. Whenever the VFS is about to do something which involves writing, it must first call:

    int mnt_want_write(struct vfsmount *mnt);

The return value is zero if write access is possible, or a negative error code otherwise. This call can be found in obvious places - such as in the implementation of open() - when write access is requested. But write access comes into play many other situations as well; for example, renaming a file requires write access for the duration of the operation. So mnt_want_write() calls have been sprinkled throughout the VFS code.

When write access is no longer needed, the "write lock" should be released with a call to:

    void mnt_drop_write(struct vfsmount *mnt);

One of the discoveries which has been made is that write access is needed in rather more places than one might have thought. In particular, it turns out that there is need for mnt_want_write() calls within the low-level filesystems as well as in the VFS layer. So getting the read-only bind mounts patch into shape has been an ongoing process of finding the spots which have been missed and adding mnt_want_write() calls there. In an attempt to make this process a bit less error-prone, Miklos Szeredi has put together a set of VFS helper functions which encapsulate the situations where write access is needed. Those functions have not been merged for 2.6.26, however.

Superficially, mnt_want_write() is easy to understand - it simply increments a counter of outstanding write accesses. The problem with a simple implementation, though, is that a shared, per-filesystem counter would create scalability problems. On multiprocessor systems, the cache line containing the counter would bounce around the system, slowing things considerably.

A common response to this type of problem is to turn the counter into a per-CPU variable, allowing operations on the counter to remain local to each processor. When somebody needs to know the total value of the counters, it's a simple matter of adding each CPU's version; this operation is slow, but it is also rare. On big systems, though, the number of CPUs can be large - as can the number of filesystems, and bind mounts will only increase that number. The result is a multiplicative effect which, once again, is a scalability problem, only this time it manifests itself in the form of excessive memory use.

The read-only bind mounts patch resolves this situation by, in effect, going back to global counters which are cached on specific processors. To that end, each CPU has one of these structures:

    struct mnt_writer {
	spinlock_t lock;
	unsigned long count;
	struct vfsmount *mnt;
    }

At any given time, this structure will hold a local count for one filesystem, represented by mnt. If the processor needs to adjust the write count for that filesystem, it's a simple matter of incrementing or decrementing count. When the processor's attention turns to a different filesystem, it must first adjust the global count for the old filesystem, then it can switch its local mnt_writer structure to the new one. The result is a compromise between purely local and purely global counters which yields "good enough" performance on benchmarks designed to stress the system.

Read-only bind mounts join with other features (such as shared subtrees) to create a flexible set of tools for the construction of the filesystem namespace. It is not clear how much of this functionality is being used at this time, but, as the implementation of containers in the mainline gets closer to completion, there is likely to be more interest in this capability. Linux systems in coming years may have much more complex filesystem layouts than have been seen in the past.

Index entries for this article
Kernel	Bind mounts
Kernel	Filesystems/Mounting

Read-only bind mounts

Posted May 15, 2008 13:56 UTC (Thu) by baryluk (guest, #52098) [Link]

It is also usefull in backup sitution, when you want users to have access to backup data, but
readonly. Because backuping data needs write access to partition which holds backup, and
probably you want to save space using hardlinking you must have both read-write access for
root, and read only for users because backuped data have orginal permissions which can state
that user have read-write access to them (this is bad because, if files are hardlinked beetwen
snapshots file modifications can destroy previous snapshots! and stupid users can also destroy
own backups, which is also bad :D ).

Very usefull feature.

Read-only bind mounts

Posted May 25, 2008 12:53 UTC (Sun) by muwlgr (guest, #35359) [Link] (2 responses)

With years, Linux is getting closer and closer to Plan9 :>

Read-only bind mounts

Posted May 25, 2008 15:05 UTC (Sun) by nix (subscriber, #2304) [Link] (1 responses)

This *is* Al Viro's explicit goal, yes. I don't think it's a bad one.

Read-only bind mounts

Posted Jul 1, 2008 2:56 UTC (Tue) by uriel (guest, #20754) [Link]

I hope he gets around implementing proper union mounts soon, at least there has been some progress on the user space front and we have a 9mount command that is more or less safe for non-root users to use.

Plan 9 from User Space has also made quite a bit of progress and is quite usable (one can even use venti to have automatic snapshot backups on almost any *nix system, which is quite a bit better than Apple's TimeMachine).

And now there is 9vx which makes running Plan 9 kernels on top of linux a breeze...

Read-only bind mounts

Posted Aug 28, 2008 20:03 UTC (Thu) by john.at.satlantic (guest, #53644) [Link] (3 responses)

Trouble on ScientificLinux 5.1 (newest kernel as of this writing)

mount -o remount,ro /home/new-mount-point/some-data

makes all of /home read-only.

Read-only bind mounts

Posted Aug 31, 2008 10:58 UTC (Sun) by MONK (guest, #53684) [Link] (2 responses)

I tried it on Ubuntu 8.10 and although in mount it said ro, I was still able to change files in the bind point :(

Read-only bind mounts

Posted Nov 6, 2009 14:49 UTC (Fri) by terryburton (guest, #26261) [Link] (1 responses)

It seems unfortunate that the situation still isn't remedied after 18 months so I've raised this with the maintainers of util-linux-ng...

"[security] mount: Read-only bind mount silent failure then misreporting options"

http://thread.gmane.org/gmane.linux.utilities.util-linux-...

Have fun,

Terry

http://www.terryburton.co.uk

Read-only bind mounts STILL BROKEN in mainline?

Posted Dec 7, 2011 21:03 UTC (Wed) by gvy (guest, #11981) [Link]

As of 2.6.39 and 3.0.8 at least:
# mount -o bind,ro /home /mnt/cdrom
mount: warning: /mnt/cdrom seems to be mounted read-write.
# _

OpenVZ 2.6.32 works fine though -- seems the fine folks over there did fix it.

Read-only bind mounts

Posted Mar 22, 2015 15:05 UTC (Sun) by dmjacobsen (guest, #101610) [Link]

As other commenters posted, the given method for producing a read-only bind mount will actually remount all the underlying filesystems (e.g., bind mounting /home/test1/asdf to /mnt, then making /mnt read-only will remount /home to be read-only).

To do this and *only* remount the bind mount as read-only:

mount -o bind /home/test/asdf /mnt
mount -o bind,remount,ro /mnt