A brief history of union mounts
Several weeks ago, I mentioned on my blog that I planned to move out of programming in the near future. A few days later I received this email from a kernel hacker friend:
At first, I thought we were losing a great hacker... But then I read on your blog: "Don't worry, I'm going to get union mounts into mainline before I change careers," and I realized this means you'll be with us for a few years yet! :)
How long has union mounts existed without going into the mainline Linux kernel? Well, to put it in a human perspective, if you'd been born the same year as the first Linux implementation of union mounts, you'd be writing your college application essays right now. Werner Almsberger began work on the Inheriting File System, one of the early ancestors of Linux union mounts, in 1993 - 17 years ago!
Background
A union mount does the opposite of a normal mount: Instead of hiding the namespace of the file system covered by the new mount, it shows a combination of the namespaces of the unioned file systems. Some use cases include a writable live CD/DVD-based system (without a complicated mess of symbolic links, bind mounts, and writable directories), and a shared base file system used by multiple clients. For an extremely detailed review of unioning file systems in general, see the LWN series:
- Unioning file systems:
Architecture, features, and design
- Unioning file systems:
Union directories and mounts
- Unioning file systems: unionfs and aufs
This article will provide a high-level overview of various implementations of union mounts from the original 1993 Inheriting File System through the present day VFS-based union mount implementation and plans for near-term development. We deliberately leave aside unionfs, aufs, and other non-VFS implementations of unioning, in large part because the probability of merging a non-VFS unioning file system into mainline appears to be even lower than that of a VFS-based solution.
readdir() redux
Throughout this article, we will place special emphasis on the evolution of readdir(), since historically it has been the greatest stumbling block for any implementation of union mounts. A summary from the first article in the LWN unioning file systems series:One of the great tragedies of the UNIX file system interface is the enshrinement of readdir(), telldir(), seekdir(), etc. family in the POSIX standard. An application may begin reading directory entries and pause at any time, restarting later from the "same" place in the directory. The kernel must give out 32-bit magic values which allow it to restart the readdir() from the point where it last stopped. Originally, this was implemented the same way as positions in a file: the directory entries were stored sequentially in a file and the number returned was the offset of the next directory entry from the beginning of the directory. Newer file systems use more complex schemes and the value returned is no longer a simple offset. To support readdir(), a unioning file system must merge the entries from lower file systems, remove duplicates and whiteouts, and create some sort of stable mapping that allows it to resume readdir() correctly. Support from userspace libraries can make this easier by caching the results in user memory.
Union mounts development time line
As mentioned earlier, one of the first implementations of a unioning was the Inheriting File System. In a pattern to be repeated by many future developers, Werner quickly became disenchanted with the complexity of the implementation of IFS and stopped working on it, suggesting that future developers try a mixed user/kernel implementation instead:
Well, I completed it to the point where it was a nice proof of concept, but still with problems (leaks inodes, probably has a few races left, was also a bit too liberal with locking, etc.).
Then I looked back at what I did and was disgusted by its complexity. So I decided that, before I might even consider proposing inclusion into the mainstream kernel, I'd have to see how much poorer (performance-wise) a user-space implementation would be. I did some initial hacking on NFS until I convinced myself that userfs might be the better approach. Unfortunately, I never found the time to work on that.
Many other kernel developers agreed with Werner. One of Linus Torvalds' earliest recorded NAKs of a kernel-based union file system came in 1996:
In 1998, Werner updated his IFS page to suggest working on a unioning file system as a good academic research topic:
Sounds like a very nice master's thesis topic for some good Linux hacker ;-) [...] So far nobody has taken the challenge. So, if you're an aspiring kernel hacker, aren't afraid of complexity, have a lot of time, and are looking for an interesting but useful project, you may just have found it :-)
Around 2003 - 2004, Jan Blunck took up the gauntlet Werner threw down and began working on union mounts for his thesis. The union mount implementation Jan produced lay dormant until 2007, when discussion about merging unionfs into mainline triggered renewed interest in a VFS-based version of unioning. At that point, Bharata B. Rao took the lead and began working with Jan Blunck on a new version of union mounts. Bharata and Jan posted several versions in 2007.
The first version posted in April 2007 used Jan's original strategy of keeping two pointers in the dentry for each directory, one pointing to the directory below this dentry's in the union stack, and one to the dentry of the topmost directory. The drawback to this implementation is that each file system can only be in one union stack at a time, since dentries are shared between all mounts of the same underlying file system.
The second version posted in May 2007 implemented yet another minor variation on in-kernel readdir(), this time using per file pointer cookies. From the patch set's documentation:
When two processes issue readdir()/getdents() call on the same unioned directory, both of them would be referring to the same dentries via their file structures. So it becomes necessary to maintain rdstate separately for these two instances. This is achieved by using a cookie variable in the rdstate. Each of these rdstate instances would get a different cookie thereby differentiating them.
In June 2007, Bharata and Jan posted a third version with an important and novel change to the way union stacks are formed. They replaced the in-dentry links to the topmost and lower directories with an external structure of pointers to (vfsmount, dentry) pairs. For the first time, a file system could be part of more than one union mount. From the patch set's documentation:
In this new approach, the way union stack is built and traversed has been changed. Instead of dentry-to-dentry links forming the stack between different layers, we now have (vfsmount, dentry) pairs as the building blocks of the union stack. Since this (vfsmount, dentry) combination is unique across all namespaces, we should be able to maintain the union stack sanely even if the filesystem is union mounted privately in different namespaces or if it appears under different mounts due to various types of bind mounts.
In July 2007, Jan
posted
a fourth version with some relatively minor changes to the way
whiteouts were implemented, among a few other things. Jan says,
"I'm able to compile the kernel with this patches applied on a 3
layer union mount with the [separate] layers bind mounted to different
locations. I haven't done any performance tests since I think there is
a more important topic ahead: better readdir() support.
"
In December 2007, Bharata B. Rao posted a fifth version that implemented another in-kernel version of readdir():
In this approach, the cached dirents are given offsets in the form of linearly increasing indices/cookies (like 0, 1, 2,...). This helps us to uniformly define offsets across all the directories of the union irrespective of the type of filesystem involved. Also this is needed to define a seek behaviour on the union mounted directory. This cache is stored as part of the struct file of the topmost directory of the union and will remain as long as the directory is kept open.
However, this approach had multiple problems, including excessive use of kernel memory to cache directory entries and to keep the mapping of indices to dentries.
readdir() continued to be a stumbling block, and union mounts
development slowed down for most of 2008. In April 2008, Nagabhushan
BS implemented
and posted
a version of union mounts with most of the readdir() logic
moved to glibc. "I went through Bharata's RFC post on glibc
based Union Mount readdir solution
(http://lkml.org/lkml/2008/3/11/34)
and have come up with patches against glibc to implement the
same.
"
However, moving the complexity to user space wasn't the panacea everyone had hoped for. The glibc maintainers had many objections, the kernel interface was an obvious kludge (returning whiteouts for "." to signal a unioned directory), and no one could figure out how to handle NFS sanely.
In November 2008, Miklos Szeredi posted a simplified version of union mounts that implemented readdir() in the kernel.
The directory entries are read starting from the top layer and they are maintained in a cache. Subsequently when the entries from the bottom layers of the union stack are read they are checked for duplicates (in the cache) before being passed out to the user space. There can be multiple calls to readdir/getdents routines for reading the entries of a single directory. But union directory cache is not maintained across these calls. Instead for every call, the previously read entries are re-read into the cache and newly read entries are compared against these for duplicates before being they are returned to user space.
This implementation only worked for file systems that return a simple increasing offset in the d_off field for readdir(). So ext2 worked, but any file system with a modern directory hashing scheme did not.
In early 2009, I started to get interested in union mounts. I talked
to several groups inside Red Hat and asked them what they needed most
from file systems. I heard the same story over and over: "We
really really need a unioning file system, but for some reason no one
at Red Hat will support unionfs...
" I did some research on the
available implementations and decided to go to work on Jan Blunck's
union mount patch set.
In May 2009, Jan Blunck and I posted a version of union mounts that implemented in-kernel readdir() using a new concept: the fallthru directory entry. The basic idea is that the first time readdir() is called on a directory, the visible directory entries from all the underlying directories are copied up to the topmost directory as fallthru directory entries. This eliminated all the problems I knew of in previous readdir() implementations, but required the topmost file system to always be read-write. This implementation also was limited to only two layers: one read-only file system overlaid with one read-write file system because we were concerned with lock ordering problems.
In October 2009, I posted a version of union mounts that implemented some of the more difficult system calls, such as truncate(), pivot_root(), and rename(). However, implementing chmod() and other system calls that modified files without opening them turned out to be fairly difficult with the current code base. We thought the hard part was copying up file data in open(), rename, and link(), but it turned out they were somewhat easier to implement because they already looked up the parent directory of the file to be altered. For union mounts, we need the parent directory's dentry and vfsmount in order to create a new version of the file in the topmost level of the union file system if necessary. open(), rename, and link() also needed the parent directory in order to create new directory entries, so we just reused the parent in the union mount in-kernel copyup code. But system calls like chmod() that only alter existing files did not bother to lookup the parent directory's path, only the target. Regretfully, I decided to start on a major rewrite.
In March 2010, I posted a rewrite of the pathname lookup mechanism for union mounts, taking into account Al Viro's recent VFS cleanups and removing a great deal of unnecessary code duplication.
In May 2010, I posted the first version of union mounts that implemented nearly all file related system calls correctly. The four exceptions were fchmod(), fchown(), fsetxattr(), and futimensat(), which will fail on read-only file descriptors. (UNIX is full of surprises; none of the VFS experts I talked to knew that these system calls would succeed on a read-only fd.)
The central primitive in this version is a function called user_path_nd(). It is a combination of user_path(), which looks up a pathname and returns the corresponding dentry and vfsmount, and user_path_parent(), which looks up the parent directory of the file or directory given by the pathname and returns the struct nameidata for the parent. (struct nameidata is too complex to describe in full here, but suffice to say it is usually needed to create an entry in a directory.) user_path_nd() returns both the parent's nameidata and the child's path. Once we have both these pieces of information, we can do an in-kernel copyup of a file in chmod() or any other system call that modifies a file. Unfortunately, user_path_nd() is also the weakest point in this version of union mounts: it's racy, inefficient, and copies up files even if the system call fails.
The day after I posted that version, I flew to North Carolina for a long-anticipated in-person code review with Al Viro. We spent three days in his office painfully reviewing the entire union mount implementation. Al immediately figured out how to delete a third of the code I'd spent the last year carefully massaging, and then outlined how to rewrite the other two-thirds of the code more elegantly, including user_path_nd(). As a result of this code review marathon, Linux has a feature-complete implementation of union mounts that has undergone a full code review by the Linux VFS maintainer for the first time. Of course, the resulting todo list is long and complex, and some problems may turn out to be insoluble, but it's an important step forward.
The biggest design change Al suggested was to move the head of the union stack back into the dentry, while keeping the rest of the union stack in a singly linked list of struct union_dir's allocated external to the dentries for the read-only parts of the union stack. This combines the speed and elegance of Jan Blunck's original design using in-dentry pointers to the union stack, with the flexibility of Bharata B. Rao's (vfsmount, dentry) pairs, which allow file systems to be part of many read-only layers. This change removed the entire union stack hash table and the associated lookup logic and shrunk the union_dir struct from 7 members to 2. I posted this hybrid linked list version on June 15, 2010.
Most recently, on June 25th, 2010, I posted a version that implemented submounts in the read-only layers, as well as allowed more than two read-only layers again. Then I went on a two week vacation - the longest vacation I've had since I started working on union mounts - and tried to forget everything I knew about it.
Future Work
The next step is to implement the remainder of Al Viro's review comments. The last big-ticket item is rewriting user_path_nd() and the in-kernel file copyup boilerplate. After that, it's back for another round of code review from Al and the other VFS maintainers. The 2010 Linux Storage and File Systems workshop is in early August. With luck we can hash out any remaining architectural problems face-to-face at the workshop and possibly merge union mounts into mainline before it's old enough to vote. Or it might languish for another 17 years outside the kernel. Such are the vicissitudes of Linux kernel development.
Acknowledgments: I want to extend special thanks to the following people: Kevin Roderick, who provided moral support, Tim Bowen, who gave me a free day at the Spoke6 co-working space while I worked on this article, and, of course, Jake Edge, whose editorial suggestions were, as usual, right on.
Index entries for this article | |
---|---|
Kernel | Filesystems/Union |
Kernel | Union mounts |
GuestArticles | Aurora, Valerie |
Posted Jul 15, 2010 10:16 UTC (Thu)
by aakef (subscriber, #38030)
[Link] (4 responses)
Cheers,
Posted Jul 15, 2010 14:09 UTC (Thu)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Jul 15, 2010 19:26 UTC (Thu)
by aakef (subscriber, #38030)
[Link]
Posted Jul 15, 2010 17:39 UTC (Thu)
by BenHutchings (subscriber, #37955)
[Link] (1 responses)
Posted Jul 15, 2010 19:24 UTC (Thu)
by aakef (subscriber, #38030)
[Link]
Posted Jul 22, 2010 12:03 UTC (Thu)
by obi (guest, #5784)
[Link] (13 responses)
I keep hearing how POSIX is broken in myriad ways; see for instance the recent discussion about locking at Lennart Poettering's blog. Maybe we should start keeping a record of all these "Great Tragedies of POSIX". So maybe one day we can do something about it. Or are we really stuck with POSIX for all eternity?
Posted Jul 22, 2010 13:43 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (9 responses)
But telldir/seekdir isn't something that POSIX got wrong. Maybe the cookie size (32bit) is a bit small, but that is as easy to overcome as the Y2K038 bug.
Either you require readdir to return all the entries in a directory in one hit, or you need a stable pointer into the directory so you can ask for the 'next' chunk.
It isn't hard to design a directory layout which allows stable indexes - it just requires a bit of fore-thought.
It *is* hard to synthesis such pointers from a union of two directories as you cannot predict or control the pointers you get from each. However it is possible to create a stable solution.
Given that the current union-mount proposal requires "white-out" objects to be created in the on-top filessytem to make objects from the below filesystem disappear, it would not be unreasonable to instead require 'white-in' objects which make objects from the below filesystem appear.
This would require a 'copy-up' of the directory when it is read (though more typically, when the directory is changed) which is a bit more harsh than the 'copy-up' that is required of files on e.g. a chmod. But it would give reliable semantics and in many real cases would not be a real burden.
To be a little more explicit: The common case with union mount is (I expect) that you union-mount an empty filesystem on top of a read-only filesystem, and then make changes. Each time you make changes in a new directory you need would to copy-up that directory and all parents that have not yet been copied-up. The copy-up involves creating a white-in object in the top directory for each object in the bottom. (It is a little more complicated than white-out as you want to store the 'DT_*' type of the underlying object). Then further changes simply happen to the top level directory.
(unfortunately the margin is too small to contain my elegant implementation....)
Posted Jul 22, 2010 21:20 UTC (Thu)
by nix (subscriber, #2304)
[Link] (6 responses)
Posted Jul 23, 2010 2:51 UTC (Fri)
by neilbrown (subscriber, #359)
[Link] (5 responses)
But in general I agree - signals make it very hard to write correct programs.
Posted Jul 24, 2010 18:16 UTC (Sat)
by nix (subscriber, #2304)
[Link] (4 responses)
(Also, POSIX doesn't ban getting EINTR on reads from a regular file, so prudence dictates expecting it.)
Posted Jul 25, 2010 4:20 UTC (Sun)
by neilbrown (subscriber, #359)
[Link] (3 responses)
Posix has a concept of 'slow' and 'not slow' reads where 'slow' reads can result in a short read or EINTR, and disk IO is explcitly not a slow read. So if your file is on disk you cannot get EINTR.
Posted Jul 31, 2010 20:27 UTC (Sat)
by nix (subscriber, #2304)
[Link] (2 responses)
Now perhaps this is a de facto universal implementation detail, but as far as I can see it isn't in POSIX itself. (Maybe I just haven't looked in the right place?)
Posted Aug 1, 2010 10:01 UTC (Sun)
by neilbrown (subscriber, #359)
[Link] (1 responses)
http://www.opengroup.org/onlinepubs/9699919799/functions/...
appears to allow any read to be interrupted, and says in the "informative" section "The issue of which files or file types are interruptible is considered an implementation design issue. This is often affected primarily by hardware and reliability issues." which is singularly unhelpful.
I was basing my statements on "man 7 signal" which does talk about "slow" devices. Clearly this isn't normative....
As you say, POSIX by itself is enough to make one wince...
Posted Aug 4, 2010 22:45 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Even 'man 7 signal' says clearly that 'The details vary across Unix systems; below, the details for Linux', and that's not terribly useful really for the vast majority of software. (I suppose you can rely on it in mdadm ;} )
Posted Jul 23, 2010 20:01 UTC (Fri)
by vaurora (guest, #38407)
[Link] (1 responses)
The implementation of fallthrus is pretty small, around a hundred lines in main VFS and then you reuse the whiteout infrastructure in the client file systems.
Posted Jul 24, 2010 2:10 UTC (Sat)
by neilbrown (subscriber, #359)
[Link]
Yes.... after writing that I went back through the original article, noticed 'fallthru' this time, and felt a bit sheepish.
I don't quite see either how you would implement fallthru using whiteout though, or why you would still want whiteout if you were using copy-up + fallthru..
It is a pity that a block-based COW solution is so inefficent - it is such a simple solution that would address many of the use-cases (not the NFS-as-underlying-filesystem case of course).
Posted Jul 25, 2010 4:18 UTC (Sun)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
In the free world, there's no arbitrary 64-handle wait limit. You can rename and delete filenames while the files they point to are still open. Shared libraries are versioned. You have a variety of fork and spawn options instead of the horror that is CreateProcess. You have a real pty interface --- it's impossible to emulate a win32 console, or even make isatty do the right thing.
You don't have eldrich horrors like WSAAcceptEx. The kernel doesn't steal your threads to do its own book-keeping. For the love of god, you don't have CreateRemoteThread. IO redirection actually works --- programs seldom open /dev/tty directly, unlike the Windows world, where the canonical way to get colored output is to write it directly to the console buffer.
In POSIXland, you don't have drive letters. You don't need to be root to create a symlink. And you don't use backslashes for directory separators. (At least most Windows software accepts forward slashes too.)
Best of all, in POSIXland, programs accept a true vector of command-line arguments. In Windows, each program receives one string that's actually the whole command line and uses its own quoting and wildcard rules to interpret it. Naturally, this approach yields wildly unpredictable results.
So, of course POSIX can be improved. But it's hardly "broken in myriad ways" ; you can only say that from a position of merciful privilege.
Posted Jul 25, 2010 7:11 UTC (Sun)
by bronson (subscriber, #4806)
[Link]
Posted Jul 31, 2010 20:41 UTC (Sat)
by nix (subscriber, #2304)
[Link]
MS really has strangled itself in the name of backward compatibility with a broken original system. Sure, we try to be compatible with older Unixes, but at least Unix was a sane base to build on.
A brief history of union mounts
In February 2007 I spent about 2 weeks on unionfs-fuse to add copy-on-write and it was then suitable for my needs.
I still spend some of my spare free time on it to improve it, but this seems to be nothing compared to to the kernel union mount efforts. Pity that we cannot combine our work. Using a kernel solution somehow seems to be more an ideological decision than rational. Yes, I agree, unionfs-fuse is slow for some operations, but now lets assume we would spend some time on it to improve that...
Bernd
A brief history of union mounts
A brief history of union mounts
A brief history of union mounts
A brief history of union mounts
But then also on the other hand, I use unionfs-fuse my own self-made ubuntu live usb sticks and I don't see any difference in boot time and application load time between usb-stick vs. usb-stick + unionfs-fuse. I have not tried it on high speed SSDs yet (I simply don't have any).
A brief history of union mounts
One of the great tragedies of the UNIX file system interface is the enshrinement of readdir(), telldir(), seekdir(), etc. family in the POSIX standard...
A brief history of union mounts
The stable pointer doesn't *have* to be exposed to user-space, but if you are going to have any hope of supporting a network filesystem like NFS, then it has to be exposed to the network protocol, so it has to exist.
A readdir simply uses the top-level directory.
Any lookup which hits a white-in object (or name) continues the lookup in the underlying filesystem.
A brief history of union mounts
It could be a nice long series - with lots of guest editors probably.
Hell yes. And the suck extends to fairly simple areas. Just saying 'EINTR' and 'short reads' is enough to make anyone who's ever written even a trivial C program on a Unix platform wince. (What do you mean I need a horrible-looking for loop to read a file reliably?!)
A brief history of union mounts
A brief history of union mounts
A brief history of union mounts
I guess being on disk on another machine doesn't count. :-(
A brief history of union mounts
A brief history of union mounts
A brief history of union mounts
A brief history of union mounts
A brief history of union mounts
Sure, POSIX has flaws. But at least it's not Win32. When you're frustrated by POSIX, take a deep breath and say that three times.
A brief history of union mounts
/dev/null
exists only in /dev
; you can create a file called "null" anywhere else. Try creating a file called "prn" anywhere on a Windows system.
A brief history of union mounts
A brief history of union mounts