ext4 and data loss
Your editor had not intended to write (yet) about this issue, but quite a few readers have suggested that we take a look at it. Since there is clearly interest, here is a quick look at what is going on.
Early Unix (and Linux) systems were known for losing data on a system crash. The buffering of filesystem writes within the kernel, while being very good for performance, causes the buffered data to be lost should the system go down unexpectedly. Users of Unix systems used to be quite aware of this possibility; they worried about it, but the performance loss associated with synchronous writes was generally not seen to be worth it. So application writers took great pains to ensure that any data which really needed to be on the physical media got there quickly.
More recent Linux users may be forgiven for thinking that this problem has been entirely solved; with the ext3 filesystem, system crashes are far less likely to result in lost data. This outcome is almost an accident resulting from some decisions made in the design of ext3. What's happening is this:
- By default, ext3 will commit changes to its journal every five
seconds. What that means is that any filesystem metadata
changes will be saved, and will persist even if the system
subsequently crashes.
- Ext3 does not (by default) save data written to files in the journal.
But, in the (default) data=ordered mode, any modified data
blocks are forced out to disk before the metadata changes are
committed to the journal. This forcing of data is done to ensure
that, should the system crash, a user will not be able to read the
previous contents of the affected blocks - it's a security feature.
- The end result is that data=ordered pretty much guarantees that data written to files will actually be on disk five seconds later. So, in general, only five seconds worth of writes might be lost as the result of a crash.
In other words, ext3 provides a relatively high level of crash resistance, even though the filesystem's authors never guaranteed that behavior, and POSIX certainly does not require it. As Ted put it in his excruciatingly clear and understandable explanation of the situation:
Accidental or not, the avoidance data loss in a crash seems like a nice feature for a filesystem to have. So one might well wonder just what would have inspired the ext4 developers to take it away. The answer, of course, is performance - and delayed allocation in particular.
"Delayed allocation" means that the filesystem tries to delay the allocation of physical disk blocks for written data for as long as possible. This policy brings some important performance benefits. Many files are short-lived; delayed allocation can keep the system from writing fleeting temporary files to disk at all. And, for longer-lived files, delayed allocation allows the kernel to accumulate more data and to allocate the blocks for data contiguously, speeding up both the write and any subsequent reads of that data. It's an important optimization which is found in most contemporary filesystems.
But, if blocks have not been allocated for a file, there is no need to write them quickly as a security measure. Since the blocks do not yet exist, it is not possible to read somebody else's data from them. So ext4 will not (cannot) write out unallocated blocks as part of the next journal commit cycle. Those blocks will, instead, wait until the kernel decides to flush them out; at that point, physical blocks will be allocated on disk and the data will be made persistent. The kernel doesn't like to let file data sit unwritten for too long, but it can still take a minute or so (with the default settings) for that data to be flushed - far longer than the five seconds normally seen with ext3. And that is why a crash can cause the loss of quite a bit more data when ext4 is being used.
The real solution to this problem is to fix the applications which are expecting the filesystem to provide more guarantees than it really is. Applications which frequently rewrite numerous small files seem to be especially vulnerable to this kind of problem; they should use a smarter on-disk format. Applications which want to be sure that their files have been committed to the media can use the fsync() or fdatasync() system calls; indeed, that's exactly what those system calls are for. Bringing the applications back into line with what the system is really providing is a better solution than trying to fix things up at other levels.
That said, it would be nice to improve the robustness of the system while we're waiting for application developers to notice that they have some work to do. One possible solution is, of course, to just run ext3. Another is to shorten the system's writeback time, which is stored in a couple of sysctl variables:
/proc/sys/vm/dirty_expire_centisecs /proc/sys/vm/dirty_writeback_centisecs
The first of these variables (dirty_expire_centiseconds) controls how long written data can sit in the page cache before it's considered "expired" and queued to be written to disk; it defaults to 30 seconds. The value of dirty_writeback_centiseconds (5 seconds, default) controls how often the pdflush process wakes up to actually flush expired data to disk. Lowering these values will cause the system to flush data to disk more aggressively, with a cost in the form of reduced performance.
A third, partial solution exists in a set of patches queued for 2.6.30; they add a set of heuristics which attempt to protect users from being badly burned in certain situations. They are:
- A
patch adding a new EXT4_IOC_ALLOC_DA_BLKS
ioctl() command. When issued on a file, it will force ext4
to allocate any delayed-allocation blocks for that file. That will
have the effect of getting the file's data to disk relatively quickly
while avoiding the full cost of the (heavyweight) fsync()
call.
- The
second patch sets a special flag on any file which has been
truncated; when that file is closed, any delayed allocations will be
forced. That should help to prevent the "zero-length
files" problem reported at the beginning.
- Finally, this patch forces block allocation when one file is renamed on top of another. This, too, is aimed at the problem of frequently-rewritten small files.
Together, these patches should mitigate the worst of the data loss problems
while preserving the performance benefits that come with delayed
allocation. They have not been proposed for merging at this late stage in
the 2.6.29 release cycle, though; they are big enough that they will have
to wait for 2.6.30. Distributors shipping earlier kernels can, of course,
backport the patches, and some may do so. But they should also note the
lesson from this whole episode: ext4, despite its apparent stability,
remains a very young filesystem. There may yet be a surprise or two
waiting to be discovered by its early users.
Index entries for this article | |
---|---|
Kernel | Filesystems/ext4 |
Posted Mar 12, 2009 1:24 UTC (Thu)
by JoeBuck (subscriber, #2330)
[Link] (18 responses)
Posted Mar 12, 2009 2:56 UTC (Thu)
by bojan (subscriber, #14302)
[Link] (17 responses)
> If you really care about making sure something is on disk, you have to use fsync or fdatasync. If you are about the performance overhead of fsync(), fdatasync() is much less heavyweight, if you can arrange to make sure that the size of the file doesn't change often. You can do that via a binary database, that is grown in chunks, and rarely truncated.
Posted Mar 12, 2009 13:37 UTC (Thu)
by eru (subscriber, #2753)
[Link] (10 responses)
So does this mean the Linux desktops should now start using something like the Windows registry database?
Posted Mar 12, 2009 23:17 UTC (Thu)
by rahulsundaram (subscriber, #21946)
[Link] (9 responses)
Posted Mar 13, 2009 0:02 UTC (Fri)
by bojan (subscriber, #14302)
[Link] (7 responses)
Maybe the real solution is to not write them out unless absolutely necessary.
Posted Mar 13, 2009 2:00 UTC (Fri)
by rahulsundaram (subscriber, #21946)
[Link] (6 responses)
Posted Mar 13, 2009 2:46 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link]
Posted Mar 13, 2009 3:37 UTC (Fri)
by bojan (subscriber, #14302)
[Link] (4 responses)
Say you want to fix a corrupt gconf XML file that is 20 lines long. The easy fix is to delete it and recreate the settings using preferences or gconf editor.
Say you want to fix a corrupt gconf XML file that is 200,000 lines long. Well, good luck not mucking it up in vi so it still parses.
> Even assuming it is, Firefox is using sqlite databases instead of inventing their own binary format.
Which, as we've seen, comes with its own set of problems on ext3. And, once again, if the DB file gets screwed, you are completely out of luck - _all_ your settings are gone. Eggs in one basket and all that.
> Binary is not necessarily evil as people seem to think.
Yeah, tell that to people with corrupt Windows registry.
> I don't see how your solution would work. When you don't write them out, you stand a higher chance of losing that data which is exactly the problem.
Nobody said anything about not writing them out. The problem is that it appears they are being written out even when _not_ required and in large numbers.
When users make changes to configuration, these are localised changes. Users don't normally change 200 settings at once. So, this will touch a very limited number of files that need to be persisted to disk using fsync. The problem is the currently hundreds of files are being persisted to disk often.
Posted Mar 13, 2009 5:32 UTC (Fri)
by eru (subscriber, #2753)
[Link] (3 responses)
Yeah, tell that to people with corrupt Windows registry.
The binary/text distinction is rather illusory. Text is simply a binary
file that uses a subset of byte values to represent data, and
certain values as delimiters. What really matters is how a file format is
structured. A binary file can be organized so that recovering data from
it is possible (what does fsck(8) really do? Fix corruption in a complex
binary file, with the constraint that the operation must be done
in-place).
Posted Mar 13, 2009 5:56 UTC (Fri)
by bojan (subscriber, #14302)
[Link]
Yeah. I edit SQLite files in vi all the time ;-)
Posted Mar 13, 2009 13:14 UTC (Fri)
by man_ls (guest, #15091)
[Link]
The machine doesn't care, true, but to people there is a big difference between a sequence of random byte values and a sequence of written words. Just as, to me personally, there is a big difference between a text in Spanish and a set of cyrillic Russian words.
Posted Mar 19, 2009 9:31 UTC (Thu)
by renox (guest, #23785)
[Link]
For computers yes, for human this is very different, that's the point!
If you have a corrupted binary, it's very, very difficult for an human to fix it (unless there's a tool which fix it "auto-magically"), whereas for a text there is still the possibility for the human to fix it.
A FS is a database, fsck is the tool to fix it (up to a point), if you add other databases in a FS this add the possibility of additional errors fixable only by the tools, with structured text files (JSON is nice: easy to read and to parse) you have the best of both worlds.
Posted Sep 9, 2009 22:02 UTC (Wed)
by BrucePerens (guest, #2510)
[Link]
Posted Mar 12, 2009 20:04 UTC (Thu)
by samroberts (subscriber, #46749)
[Link] (5 responses)
There is no class of applications that write data to a file and don't
For a long time fsync/O_SYNC were essentially no-ops on linux, the
That said, I sympathize with him about user's whining that data is lost
Posted Mar 12, 2009 22:46 UTC (Thu)
by man_ls (guest, #15091)
[Link] (1 responses)
Posted Mar 12, 2009 22:54 UTC (Thu)
by man_ls (guest, #15091)
[Link]
Posted Mar 13, 2009 14:45 UTC (Fri)
by jbailey (subscriber, #16890)
[Link]
My machine has certainly been writing things to disk all while I'm reading lwn here (logs, browser cache. If I were at home, it could be bittorrent, etc). My life wouldn't be any poorer if the system were to crash right now and none of that were recoverable.
Posted Mar 13, 2009 22:47 UTC (Fri)
by bojan (subscriber, #14302)
[Link]
Any application that uses temporary files is OK with data not hitting the disk.
Posted Mar 17, 2009 21:56 UTC (Tue)
by pphaneuf (guest, #23480)
[Link]
"No class of applications", you say?
/var/run being on a tmpfs makes sense (if we crash, then it's okay, they're not running anymore).
Another more practical one is my browser cache. If it got blown away on every reboot, I wouldn't really mind, and I would actually be pretty angry if my browser started doing fsync on every little thing (hmm, where have I heard this?).
Posted Mar 12, 2009 1:45 UTC (Thu)
by jimparis (guest, #38647)
[Link] (21 responses)
Posted Mar 12, 2009 6:55 UTC (Thu)
by jamesh (guest, #1159)
[Link] (19 responses)
Due to the behaviour of ext3, to write the metadata changes to disk (creation of "file.new" and rename of "file.new" to "file"), it was necessary for the file's blocks to be written out to disk so the previous contents won't be available. This is almost but not quite the same as journalling data too (it won't protect against partial writes if you cut power at the wrong time).
With ext4's delayed allocation, the metadata changes can be journalled without writing out the blocks. So in case of a crash, the metadata changes (that were journalled) get replayed, but the data changes don't.
If you journal data changes, presumably this won't happen on either ext3 or ext4. That is likely to give a performance hit though.
Posted Mar 12, 2009 8:49 UTC (Thu)
by job (guest, #670)
[Link]
When we see these patches instead of the behaviour we expect, we're confused. Is the behaviour hard to implement for some reason, or are we wrong in expecting it?
Delayed allocation is fine but I think most people expect metadata to be delayed accordingly.
Posted Mar 12, 2009 16:14 UTC (Thu)
by nye (guest, #51576)
[Link] (2 responses)
Are we really saying that ext4 commits metadata changes to disk (potentially a long time) before committing the corresponding data change?
That surely can't be right. Why on earth would you write metadata describing something which you know doesn't exist yet - and may never exist? Especially when the existing metadata describes something that does.
Perhaps what we're really saying is that ext4 does them in the correct order, but doesn't use barriers by default and hence they sometimes get written by the device in the wrong order? That would make more sense at least.
This is really confusing me.
Posted Mar 13, 2009 0:31 UTC (Fri)
by nix (subscriber, #2304)
[Link] (1 responses)
(I'd prefer it to delay the metadata operation as well, but apparently
Posted Mar 22, 2009 22:01 UTC (Sun)
by muwlgr (guest, #35359)
[Link]
Posted Mar 12, 2009 17:58 UTC (Thu)
by cpeterso (guest, #305)
[Link] (14 responses)
Posted Mar 13, 2009 0:24 UTC (Fri)
by giraffedata (guest, #1954)
[Link] (13 responses)
Because of the speedup. Since the beginning of Unix, people have sacrificed crash survivability for speed. An Ext2 filesystem after a crash can be in much worse state than this (because it doesn't journal even the metadata).
Even given user-level options to make the choice, the vast majority choose speed. So if delayed allocation makes access even faster, I can understand someone trading a higher probability of corrupting files.
As has been noted, applications that are affected are the ones that already accept a fair amount of corruption risk, so this is just a quantitative increase in risk, not qualitative.
The ext3 behavior that some people prefer is just an accident, by the way. The reason data=ordered is the default with ext3 is security, not crash resistance. The crash resistance is a by-product. Had ext3 originally done what ext4 does, people wouldn't have called it wrong.
Posted Mar 13, 2009 1:07 UTC (Fri)
by dododge (guest, #2870)
[Link] (12 responses)
For example if you shut down an XFS filesystem improperly, when it comes back up it may claim that recent files exist and even have the expected size -- but when you try to read them you might get zero blocks instead of real data. I believe JFS does the same thing.
Posted Mar 13, 2009 1:26 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (11 responses)
Posted Mar 13, 2009 10:44 UTC (Fri)
by rahulsundaram (subscriber, #21946)
[Link] (10 responses)
Posted Mar 13, 2009 13:26 UTC (Fri)
by man_ls (guest, #15091)
[Link] (5 responses)
The real reason ext3 is popular is (or so I contend) that it is stable and crash-resistant by default. Crash resistance may have been an design accident in the beginning, but it is what got it to be the most popular filesystem for Linux. It would seem that people are not so willing to trade robustness for speed. After all the mission of a filesystem is to keep your data until you ask for it; is it any wonder that people like it when it does just that, no matter what?
Posted Mar 15, 2009 19:32 UTC (Sun)
by rahulsundaram (subscriber, #21946)
[Link]
Posted Mar 17, 2009 22:01 UTC (Tue)
by pphaneuf (guest, #23480)
[Link] (3 responses)
My favourite characteristic of the extX family of filesystem is the ability to fsck while it being mounted. Often overlooked, but wow, do you ever miss that when you have to work with another filesystem for a period of time...
Posted Mar 17, 2009 22:37 UTC (Tue)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Mar 17, 2009 22:59 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
I agree, though, that even a read-only fsck of a filesystem mounted read-write doesn't seem that useful --- the on-disk state of a mounted filesystem is going to be slightly inconsistent anyway: it's likely that not everything has been flushed to disk yet.
Now a full (read and fix) fsck of a filesystem mounted read-only may be useful, and tolerably dangerous if followed immediately by a reboot.
Posted Mar 17, 2009 23:45 UTC (Tue)
by nix (subscriber, #2304)
[Link]
I still think it's a disgusting cheap hack sanctified only because that's
Posted Mar 14, 2009 15:13 UTC (Sat)
by jschrod (subscriber, #1646)
[Link] (3 responses)
Joachim
Posted Mar 19, 2009 1:26 UTC (Thu)
by xoddam (subscriber, #2322)
[Link] (2 responses)
Posted Mar 12, 2009 18:12 UTC (Thu)
by davecb (subscriber, #1574)
[Link]
On a system that predates POSIX and/or logging filesystems, you will get the behavior you
expect: this is exactly the Unix V6 behavior. The
data blocks will be written out, then the inode's length field will be updated, then the (atomic) rename will compete and the file will be replaced.
POSIX doesn't guarantee that: it allows people experimenting with delaying or reordering for performance reasons to weaken the guarantees.
Research filesystems tried both, and found that
one could get considerable performance advantages by
reordering the writes to be in elevator order, and
delaying them until there was enough data to coalesce adhacent writes. Some of this is now
broadly available SCSI's "tag queueing".
Alas, if a write failed, the on-disk
data was now inconsistent, and one could end up with a disk of garbage.
A former colleague, then at UofT, found he
could reorder and coalesce with great benefit, so long as he inserted "barriers" into the sequence where there were correctness-critical orderings.
Those has to remain, but most of the performance
could be kept, with a write cache and a delay of
a few seconds.
Now we're working with journaled filesystems,
which reduce the cost of preserving order even more, but have separated metadata from data updates. This introduced an new opportunity to inadvertently
order updates that broke the older, but
unpublished, correctness criteria.
Some journaled filesystems guarantee that
the sequence you (and I) use is correctness-preserving. ZFS is one of these. Others, including ext3 and 4, leave a window in which a crash will will render the filesystem inconsistent. Ext3 has a small window, and for
unknown reasons, ext4 has one as wide as the
delay period.
I'm of the opinion both could have arbitrarily small risk periods, and with a persistent write cache or journal, both can avoid all risk.
However, changing the algorithm to one which
is correctness-preserving would arguably be a better answer.
--dave
Posted Mar 12, 2009 1:51 UTC (Thu)
by aigarius (guest, #7329)
[Link] (33 responses)
Posted Mar 12, 2009 2:52 UTC (Thu)
by bojan (subscriber, #14302)
[Link] (32 responses)
Posted Mar 12, 2009 8:21 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link] (31 responses)
POSIX is a set of bare minimum requirements, not a bible for a usable system. It's perfectly legitimate to give guarantees beyond the ones POSIX dictates. A working atomic rename -- file data and all --- is one such constraint that adds to the usefulness and reliability of the system as a whole.
Applications that rename() without fsync() are *not* broken. They're merely requesting transaction atomicity without transaction durability, which is a perfectly sane thing to do in many circumstances. Teaching application developers to just fsync() after every rename() is *harmful*, dammit, both to system performance and to their understanding of how the filesystem works.
Posted Mar 12, 2009 11:48 UTC (Thu)
by epa (subscriber, #39769)
[Link] (2 responses)
Posted Mar 12, 2009 14:43 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Posted Mar 12, 2009 15:01 UTC (Thu)
by epa (subscriber, #39769)
[Link]
Posted Mar 12, 2009 20:35 UTC (Thu)
by bojan (subscriber, #14302)
[Link] (27 responses)
The question still remains the same. If an application that worked on ext3 is placed into an environment that is not ext3, will it still work OK?
PS. Apps that rely on the ext3 behaviour can always demand they run only on ext3, of course ;-)
Posted Mar 12, 2009 20:40 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link] (26 responses)
Posted Mar 13, 2009 0:06 UTC (Fri)
by bojan (subscriber, #14302)
[Link] (25 responses)
Posted Mar 13, 2009 0:16 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (24 responses)
From the application's perspective, the entire sequence of "atomically replace the content of file A" failed -- file A was left in an indeterminate state. The application has no way of stating that it wants that replacement to occur in the future, but be atomic, except to use
What the application obviously meant to happen is for the filesystem to commit both the data blocks and the rename as some point in the future, but to always do it in that order. Atomic rename without that guarantee is far less useful, and explicit syncing all the time will kill performance.
These semantics are safe and useful! They don't impact performance much because the applications that need the fastest block allocated -- databases and such -- already turn off as much caching as possible and do that work internally.
Atomic-in-the-future commits may go beyond a narrow reading of POSIX, but that's not a bad thing. Are you saying that we cannot improve on POSIX?
Posted Mar 13, 2009 0:27 UTC (Fri)
by dlang (guest, #313)
[Link] (20 responses)
anything else is guesswork by the OS.
Posted Mar 13, 2009 0:46 UTC (Fri)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Mar 13, 2009 0:49 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Posted Mar 13, 2009 7:58 UTC (Fri)
by nix (subscriber, #2304)
[Link]
(Memories of the Algol 68 standard, I think it was, going to some lengths
Posted Mar 13, 2009 0:47 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (16 responses)
Other filesystems work like ext4 does, yes. Consider XFS, which has a much smaller user base than it should, given its quality. Why is that the case? It has a reputation for data loss --- and for good reason. IMHO, it's ignoring an implied barriers created by atomic renames!
Forcing a commit of data before rename-onto-an-existing-file not only allows applications running today to work correctly, but creating an implied barrier on
Posted Mar 13, 2009 4:12 UTC (Fri)
by flewellyn (subscriber, #5047)
[Link] (15 responses)
Posted Mar 13, 2009 7:57 UTC (Fri)
by nix (subscriber, #2304)
[Link] (13 responses)
I've never seen anyone do it. Even coreutils 7.1 doesn't do it.
Posted Mar 13, 2009 8:29 UTC (Fri)
by flewellyn (subscriber, #5047)
[Link] (12 responses)
Posted Mar 13, 2009 14:50 UTC (Fri)
by foom (subscriber, #14868)
[Link] (11 responses)
Is it? If you rename from /A/file to /B/file (both on the same filesystem), what happens if the OS
While I admit not having looked, I'll bet three cookies that's perfectly allowed by POSIX.
Posted Mar 13, 2009 15:06 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (9 responses)
Posted Mar 13, 2009 18:05 UTC (Fri)
by nix (subscriber, #2304)
[Link]
Perhaps we're in vociferous agreement, I don't know.
Posted Mar 13, 2009 22:54 UTC (Fri)
by bojan (subscriber, #14302)
[Link] (7 responses)
fsync is not gratuitous. It is the D in ACID. As you mentioned yourself, rename requires only A form ACID - and that is exactly what you get.
But, Ted being a pragmatic man, reverted this to the old behaviour, simply because he knows there is a lot of broken software out there.
The fact that good applications that never lose data are already using the correct behaviour is case in point that this is how all applications should do it.
Performance implications of this approach are different than that of the old approach from ext3. In some cases ext4 will be faster. In others, it won't. But the main performance problem is bad applications that gratuitously write hundreds of small files to the file system. This is what is causing the real performance problem and should be fixed.
XFS received a lot of criticism, for what seem to be application problems. I wonder how many people lost files they were editing in emacs on that file system. I would venture a guess, not many.
Posted Mar 13, 2009 23:10 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (6 responses)
Posted Mar 13, 2009 23:46 UTC (Fri)
by bojan (subscriber, #14302)
[Link] (5 responses)
Just because something worked one way in one mode of one file system, doesn't mean it is the only way it can work, nor that applications should rely on it. If you want atomicity without durability, you get it on ext4, even without Ted's most recent patches (i.e. you get the empty file). If you want durability as well, you call fsync.
> And why, pray tell, is writing files to a filesystem a bad thing?
Writing out files that have _not_ changed is a bad thing. Or are you telling me that KDE changes all of its configuration files every few minutes?
BTW, the only reason fsync is slow on ext3, is because it does sync of all files. That's something that must be fixed, because it's nonsense.
Posted Mar 14, 2009 1:58 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
Posted Mar 15, 2009 6:01 UTC (Sun)
by bojan (subscriber, #14302)
[Link] (1 responses)
Except that rename(s), as specified, never actually guarantees that.
Posted Mar 15, 2009 6:04 UTC (Sun)
by bojan (subscriber, #14302)
[Link]
Posted Mar 14, 2009 12:53 UTC (Sat)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Mar 15, 2009 6:03 UTC (Sun)
by bojan (subscriber, #14302)
[Link]
Posted Mar 14, 2009 1:23 UTC (Sat)
by flewellyn (subscriber, #5047)
[Link]
If you were to write new data to the file and THEN call rename, a crash right afterwards might mean that the updates were not saved. But the only way you could lose the file's original data here is if you opened it with O_TRUNC, which is really stupid if you don't fsync() immediately after closing.
Posted Mar 17, 2009 7:12 UTC (Tue)
by jzbiciak (guest, #5246)
[Link]
That's a bit heavy for a barrier though. A barrier just needs to ensure ordering, not actually ensure the data is on the disk. Those are distinct needs. For example, if I use mb(), I'm assured that other CPUs will see that every memory access before mb() completed before every memory access after mb(). That's it. The call to mb() doesn't ensure that the data gets written out of the cache to its final endpoint though. So, if I'm caching, say, a portion of the video display buffer, there's no guarantee I'll see the writes I made before the call to mb() appear on the screen. Typically, though, all that's needed and desired is a mechanism to guarantee things happen in a particular order so that you move from one consistent state to the next. The atomic-replace-by-rename carries this sort of implicit barrier in many peoples' minds, it seems. Delaying the rename until the data actually gets allocated and committed is all this application requires. It doesn't actually require the data to be on the disk. In other words, fsync() is too big a hammer. It's like flushing the CPU cache to implement mb(). Is there an existing API that just says "keep these things in this order" without actually also spinning up the hard drive? With the move to more battery powered machines and media that wears out the more it's written to, it seems like a bad idea to ask developers to force the filesystem to do more writes.
Posted Mar 13, 2009 0:32 UTC (Fri)
by bojan (subscriber, #14302)
[Link] (2 responses)
As for performance, I'm not really sure why an implicit fsync that ext3 does would be faster than an explicit one done from the application, if they end up in exactly the same thing (i.e. both data and metadata being written to permanent storage). Unless this implicit fsync in ext3 is not actually the equivalent of fsync, but instead just something that works most of the time (i.e. is done in 5 second intervals, as per Ted's explanation).
Posted Mar 13, 2009 0:58 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Say you're updating a few hundred small files. (And before you tell me that's bad design: I disagree. A file system is meant to manage files.) If you were to fsync before renaming each one, the whole operation would proceed slowly. You'd need to wait for the disk to finish writing each file before moving on to the next, creating a very stop-and-go dynamic and slowing everything down.
On the other hand, if you write and rename all these files without an fsync, when the commit interval expires, the filesystem can pick up all these pending renames and flush all their data blocks at once. Then it can write all the rename records, at once, much improving the overall running time of the operation.
The whole thing is still safe because if the system dies at any point, each of the 200 configuration files will either refer to the complete old file or the complete new file, never some NULL-filled or zero-length strangelet.
Posted Mar 13, 2009 1:16 UTC (Fri)
by bojan (subscriber, #14302)
[Link]
I don't think that's bad design either. It is very useful to build an XML tree from many small files (e.g. gconf), instead of putting everything into one big one, which, if corrupted, will bring everything down.
> The whole thing is still safe because if the system dies at any point, each of the 200 configuration files will either refer to the complete old file or the complete new file, never some NULL-filled or zero-length strangelet.
I think that's the bit Ted was complaining about. It is unusual that changes to hundreds of configuration files would have to be done all at once. Users usually change a few things at a time (which would then be OK with fsync), so this must be some kind of automated thing doing it.
But, yeah, I understand what you're getting at in terms of performance of many fsync calls in a row.
Posted Mar 12, 2009 2:00 UTC (Thu)
by qg6te2 (guest, #52587)
[Link] (5 responses)
Posted Mar 12, 2009 5:53 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link]
In practice, operation 1 has worked as described on ext2, ext3, and UFS with soft-updates, but fails on XFS and unpatched ext4. Operation 1 is perfectly sane: it's asking for atomicity without durability. KDE's configuration is a perfect candiate. Browser history is another. For a mail server or an interactive editor, of course, you'd want operation 2.
Some people suggest simply replacing operation 1 with operation 2. That's stupid. While operation 2 satisfies all the constraints of operation 1, it incurs a drastic and unnecessary performance penalty. By claiming operation 1 is simply operation 2 spelled incorrectly, you remove an important word from an application programmer's vocabulary. How else is an he supposed to request atomicity without durability?
(And using a "real database" isn't a good enough answer: then you've just punted the same problem to a far heavier system, and for no good reason.)
The last patch mentioned in the article seems to make operation 1 work correctly, and that's good enough for me. Still, people need to realize that the filesystem is a database, albeit not a relational one, and that we can use database terminology to describe it.
Posted Mar 12, 2009 19:19 UTC (Thu)
by SLi (subscriber, #53131)
[Link] (3 responses)
The problem is, then you cannot talk about performance. Disks are slow,
While it's in a sense unfortunate that in ext4 this happening is more
The solution of applications fsync()ing their critical data is not only
Posted Mar 13, 2009 5:19 UTC (Fri)
by qg6te2 (guest, #52587)
[Link] (2 responses)
An appeal can be made to have better written applications, or more practically, an acceptance can be made that in the real world apps are never perfect. A file system needs to deal with that (no matter what is guaranteed by POSIX) and provide a reasonable trade-off between speed and safety.
In the case of ext3, whether by side effect or design, this trade-off is at a good point. Mounting with the "sync" option sacrifices too much speed, while in the current version of ext4 the trade-off is too aggressively in the direction of speed. Not everybody can afford a UPS, nor should a UPS be required to have a disk with sane contents after a crash.
Posted Mar 13, 2009 13:17 UTC (Fri)
by jwarnica (subscriber, #27492)
[Link]
General purpose distros assume that you have what, a gig or two of memory. Not everyone can afford memory, either. And there are special case systems which would never have that kind of memory. So if you have a shitty computer, you run either older versions, or specially targeted distros. And if you are building an embedded system, you make choices appropriately.
In 2009, if you choose to have a crippled system that doesn't have a UPS, then choose your filesystem carefully.
Posted Mar 13, 2009 16:43 UTC (Fri)
by SLi (subscriber, #53131)
[Link]
And for the case when it's not sane, there's f(data)sync().
Posted Mar 12, 2009 8:46 UTC (Thu)
by mjthayer (guest, #39183)
[Link] (5 responses)
Posted Mar 12, 2009 11:03 UTC (Thu)
by eru (subscriber, #2753)
[Link] (4 responses)
I believe this is more or less how a log-structured file system works:
http://en.wikipedia.org/wiki/Log-structured_file_system.
For some reason the idea is not in very common use.
Posted Mar 12, 2009 11:11 UTC (Thu)
by mjthayer (guest, #39183)
[Link] (1 responses)
Posted Mar 13, 2009 8:54 UTC (Fri)
by mjthayer (guest, #39183)
[Link]
Posted Mar 13, 2009 0:23 UTC (Fri)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Mar 13, 2009 10:18 UTC (Fri)
by job (guest, #670)
[Link]
Posted Mar 12, 2009 14:03 UTC (Thu)
by ricwheeler (subscriber, #4980)
[Link] (11 responses)
Applications that needs to insure data integrity should take specific steps, including:
* use fsync() when you hit state that you would like to survive a crash
It is pretty trivial to get data loss in any file system if you misconfigure and use sloppy assumptions.
If you have a boat load of apps which fail, you can easily configure your box (write cache disabled, nodelalloc for ext4, etc) to take the safe (and slow!) path.
Posted Mar 12, 2009 16:06 UTC (Thu)
by JoeBuck (subscriber, #2330)
[Link] (3 responses)
Posted Mar 12, 2009 18:05 UTC (Thu)
by cpeterso (guest, #305)
[Link] (1 responses)
Posted Mar 19, 2009 2:19 UTC (Thu)
by xoddam (subscriber, #2322)
[Link]
Posted Mar 12, 2009 20:27 UTC (Thu)
by ricwheeler (subscriber, #4980)
[Link]
You clearly don't want to blindly call fsync or use SYNC mode for normal operation.
Most applications have reasonable points where an fsync would make sense. If I remember correctly, firefox went a bit over the top trying to keep it internal database crash resistant.
For apps that really care about performance and data integrity both, you can try to batch operations - just like databases batch multiple transactions into a single commit.
File system equivalents would be when writing a bunch of files you can write them all without fsync, then go back and reopen/fsync them as a batch - try it, it will give you close to non-fsynced performance and give you a clear sense of when data is on disk safely.
Posted Mar 12, 2009 16:15 UTC (Thu)
by iabervon (subscriber, #722)
[Link] (5 responses)
So the sensible thing to do is to treat rename as dependent on the data write. Now, any program that truncates a file and then writes to it will tend to lead to 0-length files in a system crash, but that also tends to lead to 0-length files in an application crash or in a race with other code, or at least be a case where it's okay and expected to not find any particular file contents. If your desktop software actually erases all of your files and then hopes to be able to write new contents into them before anybody notices, that is an application bug, but using ext3 won't change the fact that it's failure-prone even aside from filesystem robustness.
Posted Mar 13, 2009 0:11 UTC (Fri)
by giraffedata (guest, #1954)
[Link] (4 responses)
It doesn't talk about system crashes (it wouldn't be practical to specify what a system does when it's broken), but it heavily implies crash-related function. It also does not specify that data will have been written after fsync -- POSIX is more abstract than that. The POSIX user doesn't know what a cache is; he doesn't know there's a disk drive holding his files. In POSIX, write() writes to a file. It doesn't schedule a write for later, it writes it immediately. But it allows (by implication) that certain kinds of system failures can cause previously written data to disappear from a file. It then goes on to introduce the concept of "stable storage" -- fsync() causes previously written data to be stored that way. fsync() isn't about specific I/O operations; what it does is harden previously written data so that these certain kinds of system failures can't destroy it.
POSIX is, incidentally, notoriously silent on just how stable stable is, leaving it up to the designer's imagination which system failures it hardens against. And there is a great spectrum of stability. For example, I know of no implementation where fsync hardens data against a disk drive head crash. I do know of implementations where it doesn't harden it against a data center power outage.
Posted Mar 13, 2009 2:14 UTC (Fri)
by iabervon (subscriber, #722)
[Link] (3 responses)
That is, you can think of "stable storage" as a process that reads the filesystem sometimes, and, after a crash, repopulates it with what it read last, and that fsync will only return after one of these reads after when you call it. You don't know what "stable storage" read, and it can have all of the same sorts of race conditions and time skew that any other concurrent process can. If the filesystem matches some such snapshot, it's the user or application's carelessness if anything is lost; if the filesystem doesn't match any such snapshot, it's crash-related filesystem damage.
Posted Mar 13, 2009 2:51 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
Posted Mar 13, 2009 15:26 UTC (Fri)
by iabervon (subscriber, #722)
[Link]
Posted Mar 13, 2009 16:04 UTC (Fri)
by giraffedata (guest, #1954)
[Link]
You mean undesirable. It's obviously acceptable because you and most your peers accept it every day. Even ext3 comes back after a crash with the filesystem in a state it was not in at any instant before the crash. The article points out that it does so to a lesser degree than some other filesystem types because of the 5 second flush interval instead of the more normal 30 (I think) and because two particular kinds of updates are serialized with respect to each other.
And since you said "system" instead of "filesystem", you have to admit that gigabytes of state are different after every reboot. All the processes have lost their stack variables, for instance. Knowing this, applications write their main memory to files occasionally. Knowing that even that data isn't perfectly stable, some of them also fsync now and then. Knowing that even that isn't perfectly stable, some go further and take backups and such.
It's all a matter of where you draw the line -- what you're willing to trade.
Posted Mar 13, 2009 0:25 UTC (Fri)
by nix (subscriber, #2304)
[Link]
Posted Mar 12, 2009 18:19 UTC (Thu)
by anton (subscriber, #25547)
[Link] (3 responses)
IMO the real solution is to keep the applications the same, and fix
the file system; we need to fix just one file system, and can relegate
all the others that don't give the guarantees to special-purpose
niches where data integrity is unimportant.
What guarantee should the file system give? A good one would be
this: If the application leaves consistent data if it is terminated
unexpectedly without a system crash (e.g. with SIGKILL), the data
should also be consistent in case of a system crash (although possibly
old without fsync()). One way to give this guarantee is to implement in-order
semantics.
I would welcome an article about the consistency guarantees that
Btrfs gives (maybe in a comparison with other file systems). Judging from the lack of
documentation of the guarantees (at least in prominent places), there
seems to be little interest from file system developers in this area
yet, but an article focusing on that topic may improve that state of
affairs.
Concerning the subject of my comment: Whenever someone mentions XFS, someone else
reports a story about data loss, and that's why he's no longer using
XFS. It seems that ext4 aspires to the same ideals as XFS: high
performance, large data handling capabilities, and it does not care
much for the user's data in the case of a crash. I guess ext4 will
then play a similar role among Linux users as XFS has.
Posted Mar 12, 2009 20:27 UTC (Thu)
by droundy (subscriber, #4559)
[Link]
Posted Mar 13, 2009 3:31 UTC (Fri)
by dgc (subscriber, #6611)
[Link] (1 responses)
Indeed, the tricks being played to close the reported holes in ext4
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-...
Concerning the subject title, ext4 has been replicating XFS features
Posted Mar 16, 2009 11:01 UTC (Mon)
by nye (guest, #51576)
[Link]
Don't forget the "files not modified in months are now inexplicably filled with nulls" problems that it had :P.
Posted Mar 13, 2009 2:52 UTC (Fri)
by bojan (subscriber, #14302)
[Link] (6 responses)
http://flamingspork.com/talks/2007/06/eat_my_data.odp
Interesting.
Posted Mar 13, 2009 5:46 UTC (Fri)
by qg6te2 (guest, #52587)
[Link] (5 responses)
Posted Mar 13, 2009 6:08 UTC (Fri)
by bojan (subscriber, #14302)
[Link] (4 responses)
It's not even the rule for ext3. You can easily switch to writeback and get:
> Data ordering is not preserved - data may be written into the main file system after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal file system integrity, however it can allow old data to appear in files after a crash and journal recovery.
There must be dozens of other file systems in all sorts of POSIX compatible OSes that don't behave that way (i.e. data=ordered). So, fixing one file system isn't going to be good enough solution, I think.
What's wrong with applying correct idioms in applications, the way emacs (and vim?) do?
Posted Mar 13, 2009 6:57 UTC (Fri)
by qg6te2 (guest, #52587)
[Link] (3 responses)
A simple "write data to disk" operation would have unnecessary complexity, as the slides show (a collection #ifdefs and run-time ifs). This is insane. The operating system (and hence by extension, the underlying filesystem) is supposed to abstract things away, not make things harder.
A sane filesystem should have the previous version of a file available intact, no matter when the crash occurred. To put it another way, why replicate the "safe save" functionality in each app when it can be done once in the filesystem ?
Posted Mar 13, 2009 10:41 UTC (Fri)
by bojan (subscriber, #14302)
[Link] (2 responses)
Unfortunately, that's the status of POSIX right now. And the complexity can be put into a library for everyone to share.
> A sane filesystem should have the previous version of a file available intact, no matter when the crash occurred. To put it another way, why replicate the "safe save" functionality in each app when it can be done once in the filesystem ?
Because the reality is that right now POSIX doesn't demand it, so your app is bound to bump into a file system here and there that requires exactly that. An app written safely will work with both types of file system semantics. The opposite is not true.
Posted Mar 15, 2009 13:53 UTC (Sun)
by kleptog (subscriber, #1183)
[Link] (1 responses)
1. I care that my application works correctly in the face of crashes on Linux on the default filesystem, in which case the above fixes will do it.
2. I care that my application works correctly in the face of crashes on any POSIX compliant OS, in which case you need to fix the app.
Unfortunately I come across a lot of code where the writer didn't even consider the problem, leaving bogus state even if you just kill the application at the wrong moment. I sincerely hope this brouhaha will at least cause more people to pay attention to the issue.
Posted Mar 15, 2009 18:08 UTC (Sun)
by skybrian (guest, #365)
[Link]
Also, what the most carefully written apps do isn't particularly relevant to what the filesystem should do. The choice for filesystem writers is:
a) implement just the bare standards, and most people won't use your filesystem because their apps lose data, even if it's faster.
b) implement nicer semantics so that people will actually prefer your filesystem over others. Decreased data loss after a system crash, even for poorly written apps, is such a feature.
It's the same tradeoff that exists for people who write web browsers. When the standards are too weak to achieve compatibility with most apps, you have to go beyond them. You need both good performance and good compatibility.
Without this patch, ext4 would not be competitive with ext3.
Posted Mar 16, 2009 9:30 UTC (Mon)
by tarvin (guest, #4412)
[Link] (1 responses)
/proc/sys/vm/dirty_writeback_centisecs
I suppose there is a slight error in the article. Or is Fedora 10 nonstandard regarding this?
Posted Mar 16, 2009 13:55 UTC (Mon)
by corbet (editor, #1)
[Link]
Posted Sep 9, 2009 22:30 UTC (Wed)
by Richard_J_Neill (subscriber, #23093)
[Link] (1 responses)
But in all other cases, if the disk is idle, surely the OS should flush as soon as it possibly can? What benefit occurs from waiting 30 seconds (to have more efficient writes) if the disk isn't running flat-out at this instant?
Posted Sep 11, 2009 15:48 UTC (Fri)
by nix (subscriber, #2304)
[Link]
So, with ext3 we should avoid fsync because it can cause seconds of delay for the whole system (because of data ordering constraints), but with ext4 we should fsync because otherwise data are not saved. Hmm.
ext4 and data loss
ext4 and data loss
You can do that via a binary database, that is grown in chunks, and rarely truncated.
Linux, meet the Registry
Linux, meet the Registry
Linux, meet the Registry
Linux, meet the Registry
Linux, meet the Registry
Linux, meet the Registry
> Binary is not necessarily evil as people seem to think.
Linux, meet the Registry
Linux, meet the Registry
The binary/text distinction is not illusory; it is a cognitive issue. Limiting file contents to printable characters (not just byte values since you can use multi-byte characters) makes people be able to edit them. Text files do not usually contain just random characters; they contain readable words that can be understood and documented rather easily.
Illusory?
Linux, meet the Registry
Linux, meet the Registry
ext4 and data loss
expect it to be written to disk.
attitude of the kernel developers being "apps call write(), the kernel
will put it on disk when its efficient to do so" and "linux is not a
real-time OS". Now Ted is calling such applications "badly written"? B.S.
when their OS crashes. If your operating sytem crashes, you lose all
guarantees that it worked. Such is life. Either use an OS that doesn't
crash, or run filesystems in real-time modes that write data to disk as
soon as possible after the app does the file write, and live with the
performance loss.
Or... stay with ext3?
It seemed to work fine
... which is of course your second option. Sorry, not having enough coffee these days.
It seemed to work fine
ext4 and data loss
ext4 and data loss
ext4 and data loss
I read the bug discussion but don't fully understand what's going on here.
Assume the code isext4 and data loss
fd = open("file.new",O_TRUNC|O_WRONLY|O_CREAT);
write(fd, "hi", 2);
close(fd);
rename("file.new", "file");
Are we saying that in ext4, the rename can happen many tens-of-seconds before the data "hi" is actually allocated and written?
That's concerning. I'd expect that if I do that sequence of commands, the write would happen before the rename, and a crash would lead to:
(1) Nothing gets changed (ie. the old contents are still there),
if it's been less than 30 seconds
(2) The new contents are there, if the crash happens after 30 seconds.
Anything else might be POSIX-correct but that's going to break a lot of assumptions in existing code, I think.
Or am I misunderstanding things here?
ext4 and data loss
ext4 and data loss
ext4 and data loss
ext4 and data loss
cause the blocks to be aggressively flushed if the file is closed and was
originally opened via O_TRUNC, or if the file is renamed on top of another
one.
that's really hard. Knowing what a nightmare it is to get rename() right,
I can understand that doing it lazily might not be anyone's cup of tea.)
ext4 and data loss
ext4 and data loss
With ext4's delayed allocation, the metadata changes can be journalled without writing out the blocks. So in case of a crash, the metadata changes (that were journalled) get replayed, but the data changes don't.
This is so broken. How can anyone think this is a good idea? Or an "upgrade" from ext3?
ext4 and data loss
With ext4's delayed allocation, the metadata changes can be journalled without writing out the blocks. So in case of a crash, the metadata changes (that were journalled) get replayed, but the data changes don't.
This is so broken. How can anyone think this is a good idea? Or an "upgrade" from ext3?
ext4 and data loss
ext4 and data loss
If you shut down an XFS filesystem improperly, when it comes back up it may claim that recent files exist and even have the expected size -- but when you try to read them you might get zero blocks instead of real data. I believe JFS does the same thing.
Is it any wonder, then, that XFS and JFS are seldom used despite their otherwise-wonderful characteristics?
ext4 and data loss
That sounds like a circular argument: distros don't have XFS or JFS experts because nobody cares about them anymore, and nobody cares about them because distros don't have experts. But the code to all these filesystems is open and has been there for a long while; why do distros have ext3 experts to begin with?
ext4 and data loss
ext4 and data loss
ext4 and data loss
ext4 and data loss
filesystem, often with umount-or-reboot-pleeze following it? Because your
early userspace is too deficient to fsck / before mounting it?
ext4 and data loss
ext4 and data loss
It even has special behaviour (messages and exit codes) to tell you when
you have to reboot because it just modified a mounted filesystem.
the only way Unix systems have traditionally been able to fsck /. Now
Linux has early userspace, there is no excuse for it at all other than
back-compatibility with people who don't have an initramfs or initrd (how
many of them are there? Not many, I'd wager).
ext4 and data loss
ext4 and data loss
ext4 and data loss
ext4 and data loss
ext4 and data loss
ext4 and data loss
ext4 and data loss
POSIX is a set of bare minimum requirements, not a bible for a usable system. It's perfectly legitimate to give guarantees beyond the ones POSIX dictates. A working atomic rename -- file data and all --- is one such constraint that adds to the usefulness and reliability of the system as a whole.
That's all very well, but such a guarantee has never in fact been made. (If you can find something in the ext3 documentation that makes such a promise, I will eat my words.)
Well first, that's the way it's worked in practice for years, documentation be damned. Second, these semantics are implied by the description of data=ordered.
ext4 and data loss
ext4 and data loss
Second, these semantics are implied by the description of data=ordered.
You could be right: I always thought of data=ordered as promising 'no garbage blocks in files that were enlarged just before a crash' but it could be taken as promising more.
ext4 and data loss
I don't think Emacs is wrong here, actually. In an interactive editor, I want durability and atomicity. I'm simply pointing out that sometimes it's appropriate to want atomicity without durability, and under those circumstances, using ext4 and data loss
rename
without fsync
is the right thing to do.
ext4 and data loss
That's from the filesystem's perspective.
ext4 and data loss
open-write-close-rename
. The filesystem should ensure that the entire operation happens atomically, which means flushing the file-to-be-renamed's data blocks before the rename record is written.
ext4 and data loss
ext4 and data loss
of order. I've never read any code, no matter how old, that took any
measures to allow for this.
ext4 and data loss
ext4 and data loss
post-crash state, so this is merely a QoI, but an important one.
to define the behaviour of the system under unspecified circumstances in
which the power was cut, which were meant to include things like
earthquakes.)
I agree. An explicit barrier interface would be nice. Right now, however, rename-onto-an-existing-file almost always expresses the intent to create such a barrier, and the filesystem should respect that intent. In practice, it's nearly always worked that way. UFS with soft-updates guarantees data blocks are flushed before metadata ones. ZFS goes well beyond that and guarantees the relative ordering of every write. And the vast majority of the time, on ext3, an atomic rename without an fsync has the same effect as it does on these other filesystems.
ext4 and data loss
rename
provides a very elegant way to detect the barrier the application developer almost certainly meant to write, but couldn't.
ext4 and data loss
ext4 and data loss
combine it with an fsync() of both the source and target directories,
without which you are risking data loss?
ext4 and data loss
ext4 and data loss
> if there's a crash. That's guaranteed by rename() semantics.
decides to write out the new directory metadata for /A immediately, but delay writing /B until an
hour from now? (for performance, don't-cha-know) And then the machine crashes. So now you're
left with no file at all.
ext4 and data loss
While I admit not having looked, I'll bet three cookies that's perfectly allowed by POSIX.
You know what else is also allowed by POSIX?
Come on. Adhering to POSIX is no excuse for a poor implementation! Even Windows adheres to POSIX, and you'd have to be loony to claim it's a good Unix. Look: the bare minimum durability requirements that POSIX specifies are just not sufficient for a good and reliable system. fork
rename
must introduce a write barrier with respect to the data blocks for the file involved or we will lose. Not only will you not get every programmer and his dog to insert a gratuitous fsync
in the write sequence, but doing so would actually be harmful to system performance.
ext4 and data loss
rename must introduce a write barrier with respect to the data blocks for the file involved or we will lose.
But this is exactly the behaviour that ext4 isn't currently implementing (although it will be, by default).
ext4 and data loss
ext4 and data loss
It is the D in ACID. As you mentioned yourself, rename requires only A form ACID - and that is exactly what you get.
That's my whole point: sometimes you want atomicity without durability. rename
without fsync
is how you express that. Except on certain recent filesystems, it's always worked that way. ext4 not putting a write barrier before rename
is a regression.
But the main performance problem is bad applications that gratuitously write hundreds of small files to the file system.
And why, pray tell, is writing files to a filesystem a bad thing? Writing plenty of small files is a perfectly legitimate use of the filesystem. If a filesystem buckles in that scenario, it's the fault of the filesystem, not the application. Blaming the application is blaming the victim.
ext4 and data loss
ext4 and data loss
Just because something worked one way in one mode of one file system...
There's plenty of precedent. The original Unix filesystem worked that way. UFS works that way with soft-updates. ZFS works that way. There are plenty of decent filesystems that will provide atomic replace with rename
.
...you get it on ext4, even without Ted's most recent patches (i.e. you get the empty file).
Not from the perspective of the whole operation you don't. You set out trying to replace the contents of the file called /foo/bar, atomically. If /foo/bar ends up being a zero-length file, the intended operation wasn't atomic. That's like saying you don't need any synchronization for a linked list because the individual pointer modifications are atomic. Atomic replacement of a file without forcing an immediate disk sync is something a decent filesystem should provide. Creating a write barrier on rename
is an elegant way to do that.
ext4 and data loss
ext4 and data loss
ext4 and data loss
mind someone else's data showing up in your partially-synced files after
reboot. Oh, wait, that's a security hole.
ext4 and data loss
ext4 and data loss
ext4 and data loss
ext4 and data loss
Data-before-rename isn't just an fsync when ext4 and data loss
rename
is called. That's one way of implement a barrier, but far from the best. Far better would be to keep track of all outstanding rename requests, and flush the data blocks for the renamed file before the rename record is written out. The actual write can happen far in the future, and these writes can be coalesced.
ext4 and data loss
"... POSIX never really made any such guarantee"
ext4 and data loss
Perhaps the POSIX standard should be rewritten then. The overall philosophy of Unix is to abstract away mundane things such as how data is stored on disk. The user has a moral right to expect that as little data as possible was wiped out if a machine crashes (especially if caused by an OS fault and not hardware).
Applications which want to be sure that their files have been committed to the media can use the fsync() or fdatasync() system calls
I vehemently disagree with this. It will simply cause everybody to use fsync() all the time as a blunt but simple solution to the "state of disk" problem. Which in turn will lead to lower performance, until it is taken as "common knowledge" that calls to fsync() are more hints rather than real requests. Which would of course make fsync() useless.
/proc/sys/vm/dirty_expire_centiseconds
/proc/sys/vm/dirty_writeback_centiseconds
Perhaps the above two settings can be managed automatically as a way of going around the fsync() issue. For example, the more data there is waiting to be dumped to disk, the higher the risk of loss, and hence the shorter the disk commit intervals should be. This will of course reduce the effectiveness of delayed allocation, but performance without safety is not performance at all, especially if the user has to regenerate the lost data.
The fundamental problem is that there are two similar but different operations an application developer can request:ext4 and data loss
open(A)-write(A,data)-close(A)-rename(A,B)
: replace the contents of B with data, atomically. I don't care when or even if you make the change, but whenever you get around to it, make sure either the old or the new version is in place.open(A)-write(A,data)-fsync(A)-close(A)-rename(A,B)
: replace the contents of B with data, and do it now.ext4 and data loss
want the behavior you describe.
slower than you think because your system has been caching for years.
likely than in ext3 (and it's exactly that, it's still very possible in
ext3), applications relying in that not happening are broken even in
ext3-land, because it does happen (if your system crashes, which shouldn't
happen very often - get a UPS and hardware that does not need binary
drivers).
the best solution - it's virtually also the only solution, if you want to
combine any guarantee about data integrity with any performance that isn't
from 1995.
ext4 and data loss
ext4 and data loss
ext4 and data loss
crash all the time (in fact I can't remember when it last did, must have
been in something like 2005). If it gives a speedup measured in tens of
percents, it's the only sane thing to do.
ext4 and data loss
Why not just have a journal (metadata and data lumped together) with fixed blocks for data which does not yet have its own blocks?
ext4 and data loss
ext4 and data loss
ext4 and data loss
ext4 and data loss
non-zero cost because they lead to fragmentation hell in short order. You
don't always want to access your files in the same order in which you
wrote them; they should be clustered differently. LFSes make that
distinctly nontrivial to do.
ext4 and data loss
ext4 and data loss
* when using rename, you need to fsync() both the source and target directories
* make sure that barriers are enabled if not using a battery backed storage device or disable the write cache on your disk
Try doing that on an ext3 system, and the performance of your system will go down significantly, to the point where users reject it. The fsync calls will cost you, big time. Firefox tried doing this and the Linux users nearly killed them.
ext4 and data loss
ext4 and data loss
ext3's slow fsync()
ext4 and data loss
ext4 and data loss
ext4 and data loss
I'm not sure that POSIX even specifies that fsync() or fdatasync() will be particularly useful in a system crash; it does specify that your data will have been written when the system call returns, but that doesn't mean that the system crash won't completely or even selectively destroy your filesystem.
ext4 and data loss
ext4 and data loss
Beyond POSIX, I think that users of a modern enterprise-quality *nix OS writing to a good-reliability filesystem expect is that operations which POSIX says are atomic with respect to other processes are usually atomic with respect to processes after a crash (mostly of the unexpected halt variety)
In an ideal world, that would be exactly what you'd see: after a cold restart, the system would come up in some state the system was in at a time close to the crash, not some made-up non-existent state the filesystem cobbles together from bits of wreckage. Most filesystems weaken this guarantee somewhat, but leaving NULL-filled and zero-length files that never actually existed on the running system is just unacceptable.
fsync() forces the other processes to see the operation having happened
Huh? fsync has nothing to do with what other processes see. fsync only forces a write to stable storage; it has no effect on the filesystem as seen from a running system. In your terminology, it just forces the conceptual "filesystem" process to take a snapshot at that instant.
ext4 and data loss
In an ideal world, that would be exactly what you'd see: after a cold restart, the system would come up in some state the system was in at a time close to the crash, not some made-up non-existent state the filesystem cobbles together from bits of wreckage.
The model works if you include the fact that, in a system crash, unintended things are, by definition, happening. Any failure of the filesystem to make up a possible state afterwards appears as fallout from the crash. Maybe some memory corruption changed your file descriptors, and your successful writes and successful close were some other file (but the subsequent rename found the original names). Maybe something managed to write zeros over your file lengths. It's not a matter of standards how often undefined behavior leads to noticeable problems, but it is a matter of quality.
fsync has nothing to do with what other processes see. fsync only forces a write to stable storage; it has no effect on the filesystem as seen from a running system. In your terminology, it just forces the conceptual "filesystem" process to take a snapshot at that instant.
That's what I meant to say: it makes the "filesystem" process see everything that had already happened. (And, by extension, processes that run after the system restarts, looking at the filesystem recovered from stable storage)
ext4 and data loss
In an ideal world, that would be exactly what you'd see: after a cold restart, the system would come up in some state the system was in at a time close to the crash, not some made-up non-existent state the filesystem cobbles together from bits of wreckage. Most filesystems weaken this guarantee somewhat, but leaving NULL-filled and zero-length files that never actually existed on the running system is just unacceptable.
ext4 and data loss
hardware. People's desktops, and consumer systems more generally, cannot.
The problem has nothing to do with delayed allocation, nor with the
commit interval. It has to do with the classic mistake
of writing metadata without writing the corresponding data. A
file system can easily delay allocation of file data for a minute and
still preserve the data during crashes: It just needs to write the
metadata for the new file after the data; and of course the rename
metadata and the corresponding deletion of the old file data should be
written even later. And of course the file system needs to ensure
with barriers that this is written in the right order to disk.
ext4 is the new XFS
The real solution to this problem is to fix the applications which are
expecting the filesystem to provide more guarantees than it really is.
Why should it be "the real solution" to change thousands of
applications to deal with crash-vulnerable file systems? Even if all
the application authors all agreed with this idea, how would they know
that their applications are not expecting more than the file system
guarantees?
Bringing the applications back into line with what the system is
really providing is a better solution than trying to fix things up at
other levels.
That's just wrong. But more importantly, it won't happen. So better
bring the system in line with what the applications are expecting; for
now, ext3 looks like the good-enough solution (despite Linux doing the
wrong thing (no barriers) by default), and hopefully we will have file
systems that actually give data consistency guarantees in the future.
ext4 is the new XFS
ext4 is the new XFS
in this scenario have been fixed. XFS is now much more careful to
correctly order data and metadata updates and so the "XFS ate my files"
problems have pretty much disappeared.
would appear to be copied from XFS. e.g. the flush-after-truncate
trick went into XFS back in June 2006:
without paying attention to the fixes that had been made to those
features in the past couple of years. Hence ext4 introduced the bugs
that everyone (incorrectly) continues to flame XFS for. Now ext4 is
replicating the XFS fixes to said bugs. ext4 is still going to be
playing catchup for some time.... ;)
ext4 is the new XFS
>in this scenario have been fixed. XFS is now much more careful to
>correctly order data and metadata updates and so the "XFS ate my files"
>problems have pretty much disappeared.
An interesting link
On slide 84 the following idealistic assertion is made:
An interesting link
Perhaps the applications are buggy to people living in ivory towers. The ext3 ordered mode should be the rule as how filesystem should behave, not the exception. Practicality suggests that fixing the underlying filesystem is more time and cost efficient than fixing 100,000 apps.
An interesting link
What's wrong with applying correct idioms in applications, the way emacs (and vim?) do?
An interesting link
An interesting link
An interesting link
An interesting link
Name of sysctl paths
and
/proc/sys/vm/dirty_expire_centisecs
There was a slight error in the article, yes. Sorry for any confusion...
Name of sysctl paths
Why wait, if the disk is idle?
Why wait, if the disk is idle?
themselves down a bit to save power.