Linux Storage and Filesystem workshop, day 1
Things began with a quick recap of the action items from the previous year. Some of these had been fairly well resolved over that time; these include power management, support for object storage devices, fibre channel over Ethernet, barriers on by default in ext4, the fallocate() system call, and enabling relatime by default. The record for some other objectives is not quite so good; low-level error handling is still not what it could be, "too much work" has been done with I/O bandwidth controllers while nothing has made it upstream, the union filesystem problem has not been solved, etc. As a whole, a lot has been done, but a lot remains to do.
Device discovery
Joel Becker and Kay Sievers led a session on device discovery. On a contemporary system, device numbers are not stable across reboots, and neither are device names. So anything in the system which must work with block devices and filesystems must somehow find the relevant device first. Currently, that is being done by scanning through all of the devices on the system. That works reasonably well on a laptop, but it is a real problem on systems with huge numbers of block devices. There are stories of large systems taking hours to boot, with the bulk of that time being spent scanning (repeatedly - once for every mount request) through known devices.
What comes out of the discussion, of course, is that user space needs a better way to locate devices. A given program may be searching for a specific filesystem label, UUID, or something else; a good search API would support all of these modes and more. What would be best would be to build some sort of database where each new device is added at discovery time. As additional information becomes available (when a filesystem label is found, for example), it is added to the database. Then, when a specific search is done, the information has already been gathered and a scan of the system's devices is no longer necessary.
In the simplest form, this database can be the various directories full of symbolic links that udev creates now. These directories solve much of the problem, but they can never be a complete solution for one reason: some types of devices - iSCSI targets, for example - do not really exist for the system until user space has connected to them. Multipath devices also throw a spanner into that works. For this reason, Ted Ts'o asserted that some sort of programmatic API will always be needed.
Not a lot of progress was made toward specifying a solution; the main concern, seemingly, was coming to a common understanding of the problem. What's likely to happen is that the libblkid library will be extended to provide the needed functionality. Next year, we'll see if that has been done.
Asynchronous and direct I/O
Zach Brown's stated purpose in this session was to "just rant for 45 minutes" about the poor state of asynchronous I/O (AIO) support in Linux. After ten years, he says, we still have an inadequate system which has never been fixed. The problems with Linux AIO are well documented: only a few operations are truly asynchronous, the internal API is terrible, it does not properly support the POSIX AIO API, etc. There, Zach says, are people wanting to do a lot more with AIO than is currently supported by Linux.
That said, various alternatives have been proposed over time but nobody ever tests them.
The conversation then shifted for a bit; Jeff Moyer took a turn to complain about the related topic of direct I/O. It works poorly for applications, he says, its semantics are different for different filesystems, the internal I/O paths for direct I/O are completely different from those used for buffered I/O, and it is full of races and corner cases. Not a pretty picture.
One of the biggest complications with direct I/O is the need for the system to support simultaneous direct and buffered I/O on the same file. Prohibiting that combination would simplify the problem considerably, but that is a hard thing to do. In particular, it would tend to break backups, which often want to read (in buffered mode) a file which is open for direct I/O. There was some talk of adding a new O_REALLYDIRECT mode which would lock out buffered operations, but it's not clear that the advantages would make this change worthwhile.
Another thing that would help with direct I/O would be to remove the alignment restrictions on I/O buffers. That's a hard change to make, though; many disk controllers can only perform DMA to properly-aligned buffers. So allowing unaligned buffers would force the kernel to copy data internally, which rather defeats the purpose of direct I/O. There is one use case, though, where direct I/O might still make sense: some direct I/O users really only want to avoid filling the system page cache with their data. Using the fadvise() system call is arguably a better way of achieving that goal, but application developers are said to distrust it.
All told, it seems from the discussion that there is not a whole lot to be done to improve direct I/O on Linux.
Returning to the AIO problem, the developers discussed Zach's proposed acall() API, which shifts blocking operations into special-purpose kernel threads. The use of threads in this manner promises a better AIO implementation than Linux has ever had in the past. But there is a cost: some core scheduler changes need to be made to support acall(). Among other things, there are some complexities related to transferring credentials between threads, propagating signals from AIO threads back to the original process, etc. The end result is that scheduler performance may well suffer slightly. The scheduler developers tend to be sensitive to even very small performance penalties, so there may well be pushback when acall() is proposed for mainline inclusion.
The addition of acall() would also add a certain maintenance burden. Whenever a kernel developer makes a change to the task structure, that developer would have to think about whether the change is relevant to acall() and whether it would need to be transferred to or from worker threads.
The conclusion was that acall() looks promising, and that the developers in the room thought that it could work. They also agreed, though, that a number of the relevant people were not in the room, so the question of whether acall() is appropriate for the kernel as a whole could not be answered.
RAID unification
The kernel currently contains two software RAID implementations, found in the MD and device mapper (DM) subsystems. Additionally, the Btrfs filesystem is gaining RAID capabilities of its own, a process which is expected to continue in the future. It is generally agreed that having three (or more) versions of RAID in the kernel is not an optimal situation. What a proper solution will look like, though, is not all that clear.
The session on RAID unification started with this question: who thinks that block subsystem development should be happening in the device mapper layer? A single hand was raised. In general, it seems, the developers in the room had a relatively low opinion of the device mapper RAID code. It should be said, of course, that there were no DM developers present.
What it comes down to is that the next generation of filesystems wants to include multiple device support. Plans for Btrfs include eventual RAID 6 support, but Btrfs developer Chris Mason has no interest in writing that code. It would be much nicer to use a generic RAID layer provided by the kernel. There are challenges, though. For example, a RAID-aware filesystem really wants to use different stripe sizes for data and metadata. Standard RAID, which knows little about the filesystems built on it, does not provide any such feature.
So what would a filesystem RAID API look like? Christoph Hellwig is working on this problem, but he's not ready to deal with the filesystem problem yet. Instead, he's going to start by figuring out how to unify the MD and DM RAID code. Some of this work may involve creating a set of tables in the block layer for mapping specific regions of a virtual device onto real regions in a lower-level device. The block layer already does that - it's how partitions work - but incorporating RAID would complicate things considerably. But, once that's done, we'll be a lot closer to having a general-purpose RAID layer which can be used by multiple callers.
The talk wandered into the area of error handling for a while. In particular, the tools Linux provides to administrators to deal with bad blocks are still not what they could be. There was talk about providing a consistent interface for reporting bad blocks - including tools for mapping those blocks back to the files that contain them - as well as performing passive scanning for bad blocks.
The action items that came out of this discussion include the rework of in-kernel RAID by Christoph. After that, the process of trying to define filesystem-specific interfaces will begin.
Rename, fsync, and ponies
Prior to Ted Ts'o's session on fsync() and rename(), some joker filled the room with coloring-book pages depicting ponies. These pages reflected the sentiment that Ted has often expressed: application developers are asking too much of the filesystem, so they might as well request a pony while they're at it.
Ted apologized to the room for his part in the implementation of the data=ordered mode for ext3. This mode was added as a way to improve the security of the filesystem, but it had the side effect of flushing many changes to the filesystem within a five-second window. That allowed application developers to "get lazy" and stop worrying about whether their data had actually hit the disk at the right times. Now those developers are resisting the idea that they should begin to worry again.
This problem has a longer history than many people realize. The XFS
filesystem first hit it back around 2001. But, Ted says, most application
developers didn't understand why they were getting corrupted files after a
crash. Rather than fix their applications, they just switched filesystems
- to ext3. Things worked for some time until Ubuntu users started testing
the alpha "Jaunty" release, which uses ext4 by default
makes ext4 available as an installation option. At that point,
they started finding zero-length files after crashes, and they blamed
ext4.
But, Ted says, the real problem is the missing fsync() calls. There are a number of reasons why they are not there, including developer laziness, the problem that fsync() on ext3 has become very expensive, the difficulty involved in preserving access control lists and other extended attributes when creating new files, and concerns about the battery-life costs of forcing the disk to spin up. Ted had more sympathy for some of these reasons than others, but, he says, "the application developers outnumber us," so something will have to be done to meet their concerns.
Valerie Aurora broke in to point out that application developers have been put into a position where they cannot do the right thing. A call to fsync() can stall the system for quite a while on ext3. Users don't like that either; witness the fuss caused by excessive use of fsync() by the Firefox browser. So it's not just that application developers are lazy; there are real disincentives to the use of fsync(). Ted agreed, but he also claimed that a lot of application developers are refusing to help fix the problem.
In the short term, the ext4 filesystem has gained a number of workarounds to help prevent the worst surprises. If a newly-written file is renamed on top of another, existing file, its data will be flushed to disk with the next commit. Similar things happen with files which have been truncated and rewritten. There is a performance cost to these changes, but they do make a significant part of the problem go away.
For the longer term, Ted asked: should the above-described fixes become a part of the filesystem policy for Linux? In other words, should application developers be assured that they'll be able to write a file, rename it on top of another file, omit fsync(), and not encounter zero-length files after a crash? The answer turns out to be "yes," but first Ted presented his other long-term ideas.
One of those is to improve the performance of the fsync() system call. The ext4 workarounds have also been added to ext3 when it runs in the data=writeback mode. Additionally, some block-layer fixes have been incorporated into 2.6.30. With those fixes in place, it is possible to run in data=writeback mode, avoid the zero-length file problem, and also avoid the fsync() performance problem. So, Ted asked, should data=writeback be made the default for ext3?
This idea was received with a fair amount of discomfort. The data=writeback mode brings back problems that were fixed by data=ordered; after a crash, a file which was being written could turn up with completely unrelated data in it. It could be somebody else's sensitive data. Even if it's boring data, the problem looks an awful lot like file corruption to many users. It seems like a step backward and a change which is hard to justify for a filesystem which is headed toward maintenance mode. So it would be surprising to see this change made.
[After writing the above, your editor noticed that Linus had just merged a change to make data=writeback the default for ext3 in 2.6.30. Your editor, it seems, is easily surprised.]
Finally, the idea of the fbarrier() system call was raised. Essentially, fbarrier() would ensure that any data written to a file prior to the call would be flushed to disk before any metadata changes made after the call. It could be implemented with fsync(); for ext3 data=ordered mode, it would do nothing at all. Ted did not try hard to sell this system call, saying that it was mainly there to address the laptop power consumption concern. Ric Wheeler claimed that it would be a waste of time; by the time people are actually using it, we'll all have solid-state drives in our laptops and the power concern will be gone. In general, enthusiasm for fbarrier() was low.
So the discussion turned back to the idea of generalizing and guaranteeing the ext4 workarounds. Chris Mason asked when there might be a time that somebody would not want to rename files safely; he did not get an answer. There was concern that these workarounds could not be allowed to hurt the performance of well-written applications. But the general sentiment was that these workarounds should become policy that all filesystems should implement.
pNFS
There was a session on supporting parallel NFS (pNFS). It was mostly a detailed, technical discussion on what sort of API is needed to allow clustered filesystems to tell pNFS about how files are distributed across servers. Your editor will confess that his eyes glazed over after a while, and his notes are relatively incoherent. Suffice to say that, eventually, OCFS2 and GFS will be able to communicate with pNFS servers and that all the people who really care about how that works will understand it.
Miscellaneous topics
The final session of the day related to "miscellaneous VFS topics"; the first had to do with eCryptfs. This filesystem provides encryption for individual files; it is currently implemented as a stacking filesystem using an ordinary filesystem to provide the real storage. The stacking nature of eCryptfs has long been a problem; now some Ubuntu developers are working to change it.
In particular, what they would like to do is to move the encryption handling directly into the VFS layer. Somehow users will supply a key to the kernel, which will then transparently handle the encryption and decryption of data. To that end, some sort of transformation layer will be provided to process the data between the page cache and the underlying block device.
One question that came up was: what happens when the user does not have a valid key? Should the VFS just provide encrypted data in that case? Al Viro raised the question of what happens when one process opens the file with a key while another one opens it without a key. At that point there will be a mixture of encrypted and clear-text pages in the cache, a situation which seems sure to lead to confusion. So it seems that the VFS will simply refuse to provide access to files if the necessary key is not provided.
There are various problems to be solved in the creation of the transformation layer - things like not letting processes modify a page while it is being encrypted or decrypted. Chris Mason noted that he faces a similar problem when generating checksums for pages in Btrfs. These are problems which can be addressed, though. But it was clear that this kind of transformation is likely to be built into the VFS in the future. Stacking filesystems just do not work well with the Linux VFS as it exists now.
Next up was David Brown, who works in the scientific high-performance computing field. David has an interesting problem. He runs massive systems with large storage arrays spread out across many systems. Whenever some process calls stat() on a file stored in that array, the entire cluster essentially has to come to a stop. Locks have to be acquired, cached pages have to be flushed out, etc., just to ensure that specific metadata (the file size in particular) is available and correct. So, if a scientist logs in and types "ls" in a large directory, the result can be 30 minutes in coming and little work gets done in the mean time. Not ideal.
What David would like is a "stat() light" call which wouldn't cause all of this trouble. It should return the metadata to the best of its knowledge, but it would not flush caches or take cluster-wide locks to obtain this information. If that means that the size is not entirely accurate, so be it. In the subsequent discussion, the idea was modified a little bit. "Slightly inaccurate" results would not be returned; instead, the size would simply be zeroed out. It was felt that returning no information at all was better than returning something which may have no real basis in reality.
Beyond that, there would likely be a mask associated with the system call. Initially it was suggested that the mask would be returned; it would have bits set to indicate which fields in the return stat structure are valid. But it was also suggested that the mask should be an input parameter instead; the call would then do whatever was needed to provide the fields requested by the caller. Using the mask as an input parameter would avoid the need for duplicate calls in the case where the necessary information is not provided the first time around.
The actual form of the system call is likely to be determined when somebody follows Christoph Hellwig's advice to "send a bloody patch."
The final topic of the day was union mounts. Valerie Aurora, who led this session, recently wrote an article about union filesystems and the associated problems for LWN. The focus of this session was the readdir() system call in particular. POSIX requires that readdir() provide a position within a directory which can be used by the application at any future time to return to the same spot and resume reading directory entries. This requirement is hard for any contemporary filesystem to meet. It becomes almost impossible for union filesystems, which, by definition, are presenting a combination of at least two other filesystems.
The solution that Valerie was proposing was to simply recreate directories in the top (writable) layer of the union. The new directories would point to files in the appropriate places within the union and would have whiteouts applied. That would eliminate the need to mix together directory entries from multiple layers later on, and the readdir() problem would collapse back to the single-filesystem implementation. At least, that holds true for as long as none of the lower-level filesystems in the union change. Valerie proposes that these filesystems be forced to be read-only, with an unmount required before they could be changed.
The good news is that this is how BSD union mounts have worked for a long time.
The bad news is that there's one associated problem: inode number stability. NFS servers are expected to provide stable inode numbers to clients even across reboots. But copying a file entry up to the top level of a union will change its inode number, confusing NFS clients. One possible solution to this problem is to simply decree that union mounts cannot be exported via NFS. It's not clear that there is a plausible use case for this kind of export in any case. The other solution is to just let the inode number change. That could lead to different NFS clients having open file descriptors to different versions of the file, but so be it. The consensus seemed to lean toward the latter solution.
And that is where the workshop concluded. Your editor will be attending
most of the second and final day (minus a brief absence for a cameo
appearance at the Embedded Linux Conference); a report from that day will
be posted shortly thereafter.
Index entries for this article | |
---|---|
Kernel | Block layer |
Kernel | Filesystems/Workshops |
Conference | Storage and Filesystem Workshop/2009 |
Posted Apr 7, 2009 17:37 UTC (Tue)
by thornhill (guest, #57198)
[Link] (2 responses)
Posted Apr 7, 2009 17:41 UTC (Tue)
by corbet (editor, #1)
[Link] (1 responses)
Posted Apr 9, 2009 8:56 UTC (Thu)
by nikanth (guest, #50093)
[Link]
Posted Apr 7, 2009 18:19 UTC (Tue)
by mezcalero (subscriber, #45103)
[Link] (3 responses)
Posted Apr 7, 2009 20:18 UTC (Tue)
by njs (guest, #40338)
[Link] (2 responses)
Posted Apr 8, 2009 1:33 UTC (Wed)
by mezcalero (subscriber, #45103)
[Link] (1 responses)
In addition, the POSIX aio doesn't allow such 'early' returns. (POSIX aio is an awful API anyway, with all those signals)
Posted Apr 8, 2009 6:38 UTC (Wed)
by xoddam (subscriber, #2322)
[Link]
Posted Apr 7, 2009 18:22 UTC (Tue)
by NAR (subscriber, #1313)
[Link] (30 responses)
Posted Apr 7, 2009 19:34 UTC (Tue)
by mjthayer (guest, #39183)
[Link] (3 responses)
Posted Apr 7, 2009 20:35 UTC (Tue)
by khim (subscriber, #9252)
[Link] (2 responses)
Think NGINX. If you have 50'000
clients connected in the same time and your box have 100 separate disks
with content (not unrealistic example - NGINX does have support for such
extreme conditions) then AIO is suddenly much faster then threaded I/O and
MUCH less resource-hungry.
Posted Apr 8, 2009 1:16 UTC (Wed)
by xoddam (subscriber, #2322)
[Link] (1 responses)
If one thread per outstanding operation or per client is too many, there are good userspace thread pool implementations that dedicate a few threads to waiting for IO completions whilst others get on with whatever work can proceed immediately.
I'm not convinced that pushing the thread pool down into the kernel is a performance win.
The Linux thread implementation chose for very good reasons to stick to a 1:1 relationship between userspace and kernel threads: it's because the job of multiplexing application tasks to a smaller number of system threads is hard to do in a generic way. All the choices are best made by the application developer, therefore thread pool implementations belong in userspace.
I don't really see the point of supporting POSIX signal-driven AIO at the kernel level if the implementation uses threads and sits on top of the existing synchronous IO. A userspace library could do it just as reliably using select() and kill(), for those few applications that insist on the POSIX AIO interface for whatever reason.
That said, the kernel handles asynchronous events all the time. Why exactly is it so hard to let userspace handle them asynchronously too at a low level, without going through the synchronous layer?
Posted Apr 8, 2009 12:46 UTC (Wed)
by khim (subscriber, #9252)
[Link]
While it's very important to have "light AIO" for some things (like TCP
sockets) it's not so important to have them for other things (like
in-memory pipes). If you have everything in kernel you can implement some
things with threads and other without threads and userspace does not care.
With userspace library any change require ABI schange - and that's
PAIN...
Posted Apr 7, 2009 20:15 UTC (Tue)
by lmb (subscriber, #39048)
[Link] (1 responses)
An event-driven FSA would benefit greatly from this; not everyone buys into multi-threaded paradigms. For some scenarios, this would make it possible to simplify the user-space implementation significantly.
Posted Apr 20, 2009 9:59 UTC (Mon)
by forthy (guest, #1525)
[Link]
I don't understand why there was so much objection against the syslets
- send the kernel a bunch of "IO instructions", and let it execute those
asynchronously. Passing active messages (that's what it is) is a good
idea, anyway; especially for networks like NFS4, where each "kernel call"
is quite heavy. Syslets would scale a lot better (lower load, less context
switches) than synchronous IO. Active message systems often had problems
with programmers who did not understand them (like Display Postscript), so
I guess this problem comes up again. It is not just a quality of
implementation issue, it is a fundamental quality of understanding
issue. This overall doesn't sound good. With Ted T'so, it's even worse: He
doesn't get it. It is not an option to a "save" filesystem which already
takes a performance penalty by maintaining a journal, to corrupt data. It
is an option to delay writing, and in effect, the 5 seconds update in ext3
is not what solved the problem, it is writing ordered. From an
application writer point of view, this is a quality of implementation
issue, but when I read the arguments, it's again an understanding problem.
I'm concerned; maybe it is that those hard-core Linux hackers have been
there for 20 years and are still sticking to 90s state-of-the-art?
Posted Apr 7, 2009 22:00 UTC (Tue)
by dankamongmen (subscriber, #35141)
[Link]
High-performance servers want to spawn a thread per allocated core, and have each thread fully exercising that CPU. That's why AIO can/must beat synchronous I/O (blocking or non; that's immaterial here) -- your thread can go on managing events (of course, if the CPU is necessary for the AIO to be performed, your thread won't run anyway, but the CPU can sometimes be avoided).
Posted Apr 8, 2009 0:29 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (22 responses)
In other words, no new calls would need to be introduced into the kernel, the apps would be portable and safe, interactivity would be preserved (this is an async interface), one would not need to use extra threads (signal can be delivered instead) and disk would not need to spin up to commit the file right away.
At the same time, regular fsync() would still mean "really commit now", so databases and similar software could use it safely even in laptop mode.
Posted Apr 8, 2009 4:19 UTC (Wed)
by jamesh (guest, #1159)
[Link] (21 responses)
Your aio_fsync() suggestion would give a significantly different result to fbarrier(). Consider a function that wrote a file and renamed it using the hypothetical fbarrier(): When this function completes, any process reading the file will get the new contents. The changes to the underlying block device could be delayed, but if there is a crash "filename" should either give the old contents or the new. Using aio_fsync() as you suggest would keep the same crash resilience behaviour, but would provide entirely different runtime behaviour. As the signal would likely be delivered after the function returns, the rename won't have happened at that point. So a read of "filename" will return the old file contents for some unknown period of time after the function returns.
Posted Apr 8, 2009 4:51 UTC (Wed)
by butlerm (subscriber, #13312)
[Link] (1 responses)
Posted Apr 8, 2009 4:56 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
Posted Apr 8, 2009 4:54 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
That's why you'd have to hang on to your config in buffers until last unique temp file has overwritten the actual file, which is something that programs like gconfd can do easily. At that point, the buffers would get dumped and read of the real config file would be required next time.
Normally, we are talking of about a few seconds to maybe half a minute of such behaviour here (i.e. either the amount of time it takes to finish immediate fsync or the next regular kernel commit). Programs that overwrite the same config file many, many times within half a minute period are really broken, so this should generally not be an issue.
PS. The point of this whole thing with aio_fsync() is to show that there can be many different approaches to address this issue. Sure, it would require a more sophisticated code, but it can be done. If we had inotify with IN_SYNC event, we could use that too in userland to play with backup files and achieve the desired result (and it that case, read of the renamed file would always give the latest config - instead programs would have to rename foo~ into foo if they found one at startup, which would signal a crash).
PPS. As you probably noticed from my previous posts regarding ext4/POSIX, I'd be very interested to have fbarrier(), because I think we need to have a clean, new way of saying this through an API.
Posted Apr 8, 2009 4:55 UTC (Wed)
by butlerm (subscriber, #13312)
[Link]
Posted Apr 8, 2009 5:02 UTC (Wed)
by butlerm (subscriber, #13312)
[Link] (16 responses)
Posted Apr 8, 2009 6:02 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (15 responses)
Absolutely. No question about it. It's just that Linus is not keen on having it, so that got me thinking as to how the same can be done without a new call. Of course, many of the thought are, shall will say, ill conceived... :-)
> it has completely different user visible semantics
One could also do this:
1. See if "foo~" exists.
In signal handler/thread created on sigevent do:
1. Unlink "foo~".
Then you get full rename semantics with always up to date file (i.e. what I was getting at with inotify example). However, your app then needs to check at startup if "foo~" exists (which means you crashed before the signal handler/thread unlinked you backup) and if it does, rename it to "foo". Then, continue.
Posted Apr 8, 2009 6:17 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
That is when signal/thread counter reached zero.
Posted Apr 8, 2009 7:40 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (3 responses)
Quick and dirty - probably has more bugs then lines, but you'll get the picture. Compile and link with: gcc -Wall -O2 -g -o a a.c -lrt
Posted Apr 8, 2009 8:03 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
Posted Apr 9, 2009 4:17 UTC (Thu)
by pr1268 (subscriber, #24648)
[Link] (1 responses)
This is going way off-topic, but why all the runtime allocations in your sample program? Malloc(3), calloc(3), and free(3) are horribly expensive, relatively speaking. Automatic/static storage for those structs and that char buffer would be substantially faster.
Posted Apr 9, 2009 6:14 UTC (Thu)
by bojan (subscriber, #14302)
[Link]
Posted Apr 8, 2009 8:30 UTC (Wed)
by jamesh (guest, #1159)
[Link] (1 responses)
Also, readers would need to differentiate between the case of "foo~" existing because the system crashed and "foo~" existing because some other process is in the process of replacing "foo" and waiting on the fsync.
Posted Apr 8, 2009 11:02 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
Posted Apr 10, 2009 4:53 UTC (Fri)
by butlerm (subscriber, #13312)
[Link] (7 responses)
That is what ext4 (and apparently XFS) do in data=writeback mode when
This solution, as it turns out, is very similar to the practice of keeping
Posted Apr 10, 2009 5:26 UTC (Fri)
by bojan (subscriber, #14302)
[Link] (6 responses)
No entirely true, actually. Imagine two processes reading the same file "foo". After they read it, they make the changes in memory, write them out to "foo.new" and then rename into "foo". Which changes will persist? From the fist or the second process?
You have to have some kind of synchronisation to do this (flock(), semaphore etc.). Which can also be applied to the example with "foo~" files to sync access. That's why Gnome has a daemon (i.e. single process) to manage all these changes.
PS. Of course, fbarrier() is still a much better solution, cleaner etc., but you cannot just say that multiple processes can do this as the please.
Posted Apr 10, 2009 5:49 UTC (Fri)
by butlerm (subscriber, #13312)
[Link] (5 responses)
Ext4 does a certain amount of comparable undo already - if the replacement
Posted Apr 10, 2009 7:35 UTC (Fri)
by bojan (subscriber, #14302)
[Link] (4 responses)
I know. What I'm talking about is synchronisation between processes in terms of contents of data (i.e. one process may write a change, which gets lost when another process does the same - your stock race). So, you cannot just open(), write(), close(), rename() with multiple processes. You have to lock, otherwise your processes will stomp all over each other's data. An example of doing the same with multiple processes when kernel doesn't guarantee data before metadata on rename is below. Bugs included, of course ;-).
Posted Apr 10, 2009 15:34 UTC (Fri)
by butlerm (subscriber, #13312)
[Link]
Posted Apr 11, 2009 10:22 UTC (Sat)
by bojan (subscriber, #14302)
[Link]
Posted Apr 11, 2009 12:06 UTC (Sat)
by bojan (subscriber, #14302)
[Link] (1 responses)
Posted Apr 12, 2009 4:52 UTC (Sun)
by bojan (subscriber, #14302)
[Link]
A more robust version below:
Posted Apr 7, 2009 20:46 UTC (Tue)
by ncm (guest, #165)
[Link] (7 responses)
Posted Apr 7, 2009 21:57 UTC (Tue)
by lmb (subscriber, #39048)
[Link]
Posted Apr 7, 2009 22:10 UTC (Tue)
by Chousuke (subscriber, #54562)
[Link] (4 responses)
Posted Apr 8, 2009 1:49 UTC (Wed)
by mjg59 (subscriber, #23239)
[Link] (3 responses)
Posted Apr 8, 2009 13:33 UTC (Wed)
by vaurora (subscriber, #38407)
[Link] (2 responses)
Posted Apr 8, 2009 13:44 UTC (Wed)
by mjg59 (subscriber, #23239)
[Link] (1 responses)
Posted Apr 14, 2009 7:10 UTC (Tue)
by dlang (guest, #313)
[Link]
so sync was not enough sync;sync would do the job, but sync;sync;sync is what people ended up using.
this could be yet another myth, but it seems to match the facts that I have run across on the topic
Posted Apr 7, 2009 23:34 UTC (Tue)
by bojan (subscriber, #14302)
[Link]
Posted Apr 7, 2009 21:40 UTC (Tue)
by zdzichu (subscriber, #17118)
[Link]
Posted Apr 7, 2009 22:05 UTC (Tue)
by dberkholz (guest, #23346)
[Link] (2 responses)
Posted Apr 8, 2009 1:11 UTC (Wed)
by ewan (subscriber, #5533)
[Link] (1 responses)
Posted Apr 8, 2009 4:50 UTC (Wed)
by jamesh (guest, #1159)
[Link]
A union performed on the server would only result in local IO when copying the file between layers.
I'm not sure how much of a difference this would make though, since it'd only really hit renames of files in the base layer or partial modification of files. Neither the "truncate and overwrite" or "write to a temporary file and rename over old" methods of writing files would show much difference between server and client side unions.
Posted Apr 7, 2009 22:18 UTC (Tue)
by kjp (guest, #39639)
[Link]
Posted Apr 8, 2009 0:05 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
We got a whole bunch of latency improvements on the back of this as well. fsync(), at least in writeback mode of ext3, should not kill the system for a number of seconds any more, even with an evil dd running in the background:
Posted Apr 8, 2009 1:22 UTC (Wed)
by pr1268 (subscriber, #24648)
[Link] (6 responses)
[After writing the above, your editor noticed that Linus had just merged a change to make data=writeback the default for ext3 in 2.6.30. Your editor, it seems, is easily surprised.] I'm surprised, also. And a little disappointed. After all, data=ordered has worked fine for, what, 7 1/2 years now? If it ain't broke, don't fix it. I'll be modifying my fstab files accordingly. </grumble>
Posted Apr 8, 2009 1:38 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (5 responses)
tune2fs -o journal_data_ordered
Posted Apr 8, 2009 3:44 UTC (Wed)
by pr1268 (subscriber, #24648)
[Link] (4 responses)
Thanks! But, I'm curious: If I specify journal_data_ordered in tune2fs(8), and I put data=writeback in /etc/fstab, which mode actually gets used?
Posted Apr 8, 2009 3:57 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
-o [^]mount-option[,...]
I take this to mean that if you have "defaults" in /etc/fstab, then the FS option can specify what is the default for that particular file system. I would think the default would get overridden by an explicit mount option from /etc/fstab though.
But, that's easy to check. Just set the options on the FS level and/or fstab, mount and check dmesg to see what really happened.
Posted Apr 8, 2009 5:11 UTC (Wed)
by butlerm (subscriber, #13312)
[Link]
Posted Apr 8, 2009 5:18 UTC (Wed)
by butlerm (subscriber, #13312)
[Link] (1 responses)
"-o [mount_option,...]
Set or clear the indicated default mount options in the filesys-
There are really three "defaults" of course. The filesystem implementation
Posted Apr 8, 2009 7:13 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
Posted Apr 10, 2009 5:24 UTC (Fri)
by vaurora (subscriber, #38407)
[Link]
Posted Apr 12, 2009 14:42 UTC (Sun)
by mdz@debian.org (guest, #14112)
[Link]
Ubuntu does not yet use ext4 by default in any version, including the "Jaunty" development series soon to become Ubuntu 9.04.
Jaunty does offer ext4 as a non-default option to users who partition their disks manually, as a convenient way for more adventurous users to start testing it.
Fedora 11, I believe, will install using ext4 by default.
Posted Apr 13, 2009 22:49 UTC (Mon)
by tytso (subscriber, #9993)
[Link] (1 responses)
Another file system developer who had worked on two major filesystems (ext4 and XFS) had a t-shirt on that had O_PONIES written on the front. And the joker who distributed the colouring book pages with pictures of ponies was another file system developer working yet another next generation file system.
Application programmers, while they were questioning my competence, judgement, and even my paternity, didn't quite believe me when I told them that I was the moderate on these issues, but it's safe to say that most of the file system developers in the room were utterly unsympathetic to the idea that it was a good idea to encourage application programmers to avoid the use of fsync(). About the only one who was also a moderate in the room was Val Aurora (formerly Henson). Both of us recognize that ext3's data=ordered mode was responsible for people deciding that fsync() was harmful, and I've said already that if we had known how badly it would encourage application writers to Do The Wrong Thing, I would have pushed hard not to make data=ordered the default. Unfortunately, memory wasn't as plentiful in those days, and so the associated page writeback latencies wasn't nearly as bad ten years ago.
Posted Apr 14, 2009 0:06 UTC (Tue)
by pr1268 (subscriber, #24648)
[Link]
Perhaps I'm a little confused, Both of us recognize that ext3's data=ordered mode was responsible for people deciding that fsync() was harmful Is your use of the word "harmful" implying a performance hit only? I'm convinced that fsync() is (relatively) safe and reliable, all other discussions here, there, anywhere, and in POSIX, aside. The O_PONIES bit (and related T-shirt) was an interesting bit of humor—I like it!
Posted Apr 18, 2009 3:53 UTC (Sat)
by roelofs (guest, #2599)
[Link]
"But at least there is symmetry."*
*Ponies -> beast of burden -> ObZathrasQuote... Sorry, too obscure? :-)
Posted Apr 19, 2009 12:58 UTC (Sun)
by oak (guest, #2786)
[Link]
For example mainline already has file systems where some of the stat()
I think nowadays applications should consider stat() mostly as a "best
Linux Storage and Filesystem workshop, day 1
I wouldn't look for a lot of PDF files. It's all very discussion-oriented, so there's not much in the way of presentations.
Presentations
Presentations
Linux Storage and Filesystem workshop, day 1
Linux Storage and Filesystem workshop, day 1
Linux Storage and Filesystem workshop, day 1
Using threads for AIO? Best do it in userspace.
Async I/O
Async I/O
They are
Why does AIO need to be supported at the kernel level?
I think the idea is to support AIO for all objects with a single API
Async I/O
Async I/O
Async I/O
Async I/O
Async I/O
fp = open("filename.tmp")
write(fp, "data", length)
fbarrier(fp)
close(fp)
rename("filename.tmp", "filename")
Async I/O
specification allows a null implementation of fsync. If one is not
concerned about a system crash or unclean shutdown, there is no need to
call fsync, aio_fsync, fbarrier or any other comparable function.
Async I/O
Async I/O
Async I/O
overhead mostly. Either way, the rename is atomic and immediately visible
to all user process. What it isn't is necessarily durable, or safe.
Async I/O
problems of the solution suggested by the parent poster, namely delaying
the rename until after the aio_fsync has completed. Aside from the
complexity and overhead issues, it has completely different user visible
semantics. That is fine if no other process needs to read the new version
of the file in the meantime, otherwise it is problematic. fbarrier would
be a much cleaner solution.
Async I/O
2. If it doesn't, do link("foo","foo~") (i.e. create "backup").
3. Open "foo".
4. Read "foo".
5. Open/create/truncate "foo.new".
6. Write into "foo.new".
7. Call aio_fsync() on "foo.new". <-- doesn't block
8. Close "foo.new".
9. Rename "foo.new" into "foo".
Async I/O
Async I/O
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <signal.h>
#include <aio.h>
#define BUF_SIZE 50
static int count=0;
void whack(int signum,siginfo_t *info,void *context){
if(!--count)
unlink("foo~");
}
int main(int argc,char **argv){
int fd;
ssize_t rl;
char *buf=malloc(BUF_SIZE);
struct aiocb *cb=calloc(1,sizeof(*cb));
struct sigevent *se=calloc(1,sizeof(*se));
struct sigaction *act=calloc(1,sizeof(*act));
/* XXX this is just a demo, no error checking */
/* AIO control block defaults */
cb->aio_sigevent.sigev_notify=SIGEV_SIGNAL;
cb->aio_sigevent.sigev_signo=SIGRTMIN;
/* signal handler */
act->sa_flags=SA_SIGINFO;
act->sa_sigaction=whack;
sigaction(SIGRTMIN,act,NULL);
/* see if foo~ exists and restore */
if(!access("foo~",F_OK|R_OK|W_OK))
rename("foo~","foo");
/* back it up if required */
if(access("foo~",F_OK|R_OK|W_OK))
link("foo","foo~");
/* read existing file */
fd=open("foo",O_RDONLY);
rl=read(fd,buf,BUF_SIZE);
close(fd);
/* write to new file and initiate sync */
fd=open("foo.new",O_WRONLY|O_CREAT|O_TRUNC,S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH);
write(fd,buf,rl);
cb->aio_fildes=fd;
count++;
aio_fsync(O_SYNC,cb);
close(fd);
/* rename new file into the existing one */
rename("foo.new","foo");
free(act);
free(se);
free(cb);
free(buf);
return 0;
}
Async I/O
Why all the runtime allocations? (off-topic)
Why all the runtime allocations? (off-topic)
Async I/O
Async I/O
Async I/O
be doing already. There is a relatively simple solution to this that I
have mentioned a few times that is applicable to virtually any journalled
filesystem that has none of the performance cost of falling back to
data=ordered mode every time someone wants to do a rename replacement.
rename safety is enabled - force all the data for the file to be renamed to
disk before the next metadata transaction can complete. That means that
*every* outstanding fsync operation is delayed while your multi-gigabyte
ISO file finishes being committed to disk.
tilde files. It is just that the filesystem does it automatically and
invisibly, restoring the old version on recovery whenever the new version
didn't finish getting committed to disk. No threads, signal handlers, etc.
required. No problems with multiple process access. No application level
code to figure out whether a version is corrupt. No browser freeze ups.
Rename undo is the way to avoid all that, with little or no performance
cost.
Async I/O
Kernel based rename undo
replacement an undo entry is placed in the journal and the old inode is
kept around until the new inodes data is committed to disk. Then in the
case of an unclean shutdown the filesystem recovery process rolls forward
using journal and uses the undo entries in the journal to build a rename
undo candidate list. When the journal redo is complete, the filesystem
then uses the rename undo list to undo the rename replacements whenever the
replacement inode's data was not committed before the system crashed.
file was not committed to disk, the allocated blocks are freed and the
filesystem truncates the file. What I suggest is not much more complicated
than that.
Kernel based rename undo
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
#include <aio.h>
#define BUF_SIZE 50
static int *count=NULL;
/* XXX this is just a demo, no error checking */
static void whack(int signum,siginfo_t *info,void *context){
int sd=*(int*)info->si_value.sival_ptr;
/* critical section */
lockf(sd,F_LOCK,0);
if(!--(*count))
unlink("foo~");
/* end critical section */
lockf(sd,F_ULOCK,0);
}
int main(int argc,char **argv){
int sd,fd;
ssize_t len;
char buf[BUF_SIZE];
struct aiocb cb;
const struct aiocb *cbl[]={&cb};
struct sigaction act;
/* AIO control block setup */
memset(&cb,0,sizeof(cb));
cb.aio_sigevent.sigev_notify=SIGEV_SIGNAL;
cb.aio_sigevent.sigev_signo=SIGRTMIN;
cb.aio_sigevent.sigev_value.sival_ptr=&sd;
/* signal handler setup */
memset(&act,0,sizeof(act));
act.sa_flags=SA_SIGINFO;
act.sa_sigaction=whack;
sigaction(SIGRTMIN,&act,NULL);
/* setup shared counter, restore */
if((sd=shm_open("foo",O_RDWR|O_CREAT|O_EXCL,S_IRUSR|S_IWUSR))==-1){
int tries=20;
struct stat s;
/* not the first to arrive, open and wait for counter to be written */
sd=shm_open("foo",O_RDWR,S_IRUSR|S_IWUSR);
fstat(sd,&s);
while(tries-- && s.st_size<sizeof(*count)){
sleep(1);
fstat(sd,&s);
}
/* something's really screwed */
if(!tries)
return 1;
} else{ /* first to arrive, restore */
int count=0; /* filler */
/* don't care if we fail */
if(!rename("foo~","foo"))
fprintf(stderr,"Restored.\n");
write(sd,&count,sizeof(count));
}
/* shared counter */
count=mmap(NULL,sizeof(int),PROT_READ|PROT_WRITE,MAP_SHARED,sd,0);
/* critical section */
lockf(sd,F_LOCK,0);
/* don't care if it fails - already there */
link("foo","foo~");
/* read existing file */
fd=open("foo",O_RDONLY);
len=read(fd,buf,BUF_SIZE);
close(fd);
/* write to new file and initiate sync */
fd=open("foo.new",O_WRONLY|O_CREAT|O_TRUNC,S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH);
write(fd,buf,len);
cb.aio_fildes=fd;
(*count)++;
aio_fsync(O_SYNC,&cb);
close(fd);
/* put the new file in place */
rename("foo.new","foo");
/* end critical section */
lockf(sd,F_ULOCK,0);
/* do something really useful here */
/* wait for AIO completion */
aio_suspend(cbl,1,NULL);
/* clean up shared memory */
munmap(count,sizeof(int));
close(sd);
return 0;
}
Kernel based rename undo
to single writer / multiple readers, which is a far more common situation.
Kernel based rename undo
Kernel based rename undo
Kernel based rename undo
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
#include <aio.h>
#include <errno.h>
/* XXX this is just a demo, no error checking */
static int sd=-1;
static int count=0;
static char filler[2]={0,0};
/* locks */
static struct flock
fwl={.l_type=F_WRLCK,.l_whence=SEEK_SET,.l_start=0,.l_len=1},
ful={.l_type=F_UNLCK,.l_whence=SEEK_SET,.l_start=0,.l_len=1},
bwl={.l_type=F_WRLCK,.l_whence=SEEK_SET,.l_start=1,.l_len=1},
brl={.l_type=F_RDLCK,.l_whence=SEEK_SET,.l_start=1,.l_len=1},
bul={.l_type=F_UNLCK,.l_whence=SEEK_SET,.l_start=1,.l_len=1};
static void aiodone(int signum,siginfo_t *info,void *context){
/* signal counter down */
count--;
}
#define BUF_SIZE 50
static void config(struct aiocb *cb){
int fd;
ssize_t len;
char buf[BUF_SIZE];
/* critical section */
while(fcntl(sd,F_SETLKW,&fwl));
/* don't care if it fails, any version is OK */
link("foo","foo~");
/* read existing file */
fd=open("foo",O_RDONLY);
len=read(fd,buf,BUF_SIZE);
close(fd);
/* write to new file */
fd=open("foo.new",O_WRONLY|O_CREAT|O_TRUNC,S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH);
write(fd,buf,len);
/* AIO control block setup */
memset(cb,0,sizeof(*cb));
cb->aio_sigevent.sigev_notify=SIGEV_SIGNAL;
cb->aio_sigevent.sigev_signo=SIGRTMIN;
cb->aio_fildes=fd;
/* signal counter up */
count++;
/* initiate sync and close */
aio_fsync(O_SYNC,cb);
close(fd);
/* put the new file in place */
rename("foo.new","foo");
/* end critical section */
while(fcntl(sd,F_SETLKW,&ful));
}
#define LOOPS 10
#define TRIES 20
int main(int argc,char **argv){
int i;
struct aiocb cb[LOOPS];
struct sigaction act;
/* setup shared file, restore */
if((sd=shm_open("foo",O_RDWR|O_CREAT|O_EXCL,S_IRUSR|S_IWUSR))==-1){
int tries=TRIES;
struct stat f;
/* not the first to arrive, open and wait for restore */
sd=shm_open("foo",O_RDWR,S_IRUSR|S_IWUSR);
fstat(sd,&f);
while(tries-- && f.st_size<sizeof(filler)){
sleep(1);
fstat(sd,&f);
}
/* something's really screwed */
if(!tries)
return 1;
} else{ /* first to arrive, restore */
/* don't care if we fail */
if(!rename("foo~","foo"))
fprintf(stderr,"Restored.\n");
/* setup lock file */
write(sd,&filler,sizeof(filler));
}
/* signal handler setup */
memset(&act,0,sizeof(act));
act.sa_flags=SA_SIGINFO;
act.sa_sigaction=aiodone;
sigaction(SIGRTMIN,&act,NULL);
/* we need the backup file to be there */
while(fcntl(sd,F_SETLKW,&brl));
/* program may run config many times */
for(i=0;i<LOOPS;i++){
config(&cb[i]);
/* do something really useful here */
}
/* wait for AIO completion */
while(count)
sleep(1);
/* unlock the backup file */
while(fcntl(sd,F_SETLKW,&bul));
/* try to remove backup file */
if(!fcntl(sd,F_SETLK,&fwl)){
if(!fcntl(sd,F_SETLK,&bwl)){
unlink("foo~");
while(fcntl(sd,F_SETLKW,&bul));
}
while(fcntl(sd,F_SETLKW,&ful));
}
/* clean up shared memory */
close(sd);
return 0;
}
fsync and scripts
fsync and scripts
sysvinit-2.86-148.1
fsync and scripts
fsync and scripts
Actually, sync() is POSIX, but you are correct that the definition does not specify that the data on disk by the time it returns, just that it's been scheduled to be written out However, if you restrict your scripts to Linux and other sane operating systems, you will get the expected behavior modulo disk caching. From the sync(2) man page:
fsync and scripts
According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes,
but may return before the actual writing is done. However, since version 1.3.20 Linux
does actually wait. (This still does not guarantee data integrity: modern disks have
large caches.)
UFS on Solaris is the only case I know of that actually takes advantage of this hole and returns before the data hits disk. The last I heard, ZFS did the sane thing and waited.
fsync and scripts
fsync and scripts
fsync and scripts
Generic RAID layer
Linux Storage and Filesystem workshop, day 1
One possible solution to this problem is to simply decree that union mounts cannot be exported via NFS. It's not clear that there is a plausible use case for this kind of export in any case.
Imagine an LTSP server or another cluster fileserver that wants to ship out a shared base filesystem with host- or class-specific "overlays" for different roles.
Linux Storage and Filesystem workshop, day 1
read-only base and the host specific overlay and unify them on the
client, not union mount the two on the server and export the result.
Linux Storage and Filesystem workshop, day 1
Linux Storage and Filesystem workshop, day 1
Linux Storage and Filesystem workshop, day 1
Linux Storage and Filesystem workshop, day 1
Linux Storage and Filesystem workshop, day 1
journal_data_ordered
journal_data_ordered
Set or clear the indicated default mount options in the filesystem.
journal_data_ordered
in the absence of any user specified mount options. In other words, if you
specify data=writeback on any supporting filesystem, that is what you will
get.
journal_data_ordered
tem. Default mount options can be overridden by mount options
specified either in /etc/fstab(5) or on the command line argu-
ments to mount(8). Older kernels may not support this feature;"
level, the filesystem image level, and the /etc/fstab level, in that order.
journal_data_ordered
Linux Storage and Filesystem workshop, day 1
Linux Storage and Filesystem workshop, day 1
Things worked for some time until Ubuntu users started testing the alpha "Jaunty" release, which uses ext4 by default.
Correction (not sure if this is for Jon or Ted):
Linux Storage and Filesystem workshop, day 1
Linux Storage and Filesystem workshop, day 1
The kernel currently contains two software RAID implementations, found in the MD and device mapper (DM) subsystems.
Linux Storage and Filesystem workshop, day 1
Linux Storage and Filesystem workshop, day 1
completely correct at the moment when it's requested?
information is never correct:
* Number of used blocks is correct only on block based file systems (i.e.
not in JFFS2, UBIFS etc)
* File size doesn't correspond to how much space the file takes from the
file system if file system uses compression
* Time information depends on mount options
* Is st_dev correct for union mounts of union file systems?
guess"...