ext4 and data loss

Posted Mar 12, 2009 2:52 UTC (Thu) by bojan (subscriber, #14302)
In reply to: ext4 and data loss by aigarius
Parent article: ext4 and data loss

But this would then mean that we _should_ write applications that work with extX file system only, instead of POSIX, which is exactly what seem to have (unfortunately) happened. If an extX user changes the filesystem to non-extX (of which there are many), then these apps may break again. I don't think that's a very good deal.

ext4 and data loss

Posted Mar 12, 2009 8:21 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (31 responses)

That's rubbish. How many applications do you use that will work on a bare-bones POSIX system? It's perfectly legitimate to rely on facilities that aren't in POSIX.

POSIX is a set of bare minimum requirements, not a bible for a usable system. It's perfectly legitimate to give guarantees beyond the ones POSIX dictates. A working atomic rename -- file data and all --- is one such constraint that adds to the usefulness and reliability of the system as a whole.

Applications that rename() without fsync() are *not* broken. They're merely requesting transaction atomicity without transaction durability, which is a perfectly sane thing to do in many circumstances. Teaching application developers to just fsync() after every rename() is *harmful*, dammit, both to system performance and to their understanding of how the filesystem works.

ext4 and data loss

Posted Mar 12, 2009 11:48 UTC (Thu) by epa (subscriber, #39769) [Link] (2 responses)

POSIX is a set of bare minimum requirements, not a bible for a usable system. It's perfectly legitimate to give guarantees beyond the ones POSIX dictates. A working atomic rename -- file data and all --- is one such constraint that adds to the usefulness and reliability of the system as a whole.

That's all very well, but such a guarantee has never in fact been made. (If you can find something in the ext3 documentation that makes such a promise, I will eat my words.)

ext4 and data loss

Posted Mar 12, 2009 14:43 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (1 responses)

Well first, that's the way it's worked in practice for years, documentation be damned. Second, these semantics are implied by the description of data=ordered.

ext4 and data loss

Posted Mar 12, 2009 15:01 UTC (Thu) by epa (subscriber, #39769) [Link]

Second, these semantics are implied by the description of data=ordered.

You could be right: I always thought of data=ordered as promising 'no garbage blocks in files that were enlarged just before a crash' but it could be taken as promising more.

ext4 and data loss

Posted Mar 12, 2009 20:35 UTC (Thu) by bojan (subscriber, #14302) [Link] (27 responses)

I think emacs programmers may disagree with you on this.

The question still remains the same. If an application that worked on ext3 is placed into an environment that is not ext3, will it still work OK?

PS. Apps that rely on the ext3 behaviour can always demand they run only on ext3, of course ;-)

ext4 and data loss

Posted Mar 12, 2009 20:40 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (26 responses)

I don't think Emacs is wrong here, actually. In an interactive editor, I want durability and atomicity. I'm simply pointing out that sometimes it's appropriate to want atomicity without durability, and under those circumstances, using rename without fsync is the right thing to do.

ext4 and data loss

Posted Mar 13, 2009 0:06 UTC (Fri) by bojan (subscriber, #14302) [Link] (25 responses)

I was under the impressions that you do get atomicity. The zero length file (i.e. the one that has not been made durable yet, because fsync was not called) gets renamed to the other file just fine. No?

ext4 and data loss

Posted Mar 13, 2009 0:16 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (24 responses)

That's from the filesystem's perspective.

From the application's perspective, the entire sequence of "atomically replace the content of file A" failed -- file A was left in an indeterminate state. The application has no way of stating that it wants that replacement to occur in the future, but be atomic, except to use open-write-close-rename. The filesystem should ensure that the entire operation happens atomically, which means flushing the file-to-be-renamed's data blocks before the rename record is written.

What the application obviously meant to happen is for the filesystem to commit both the data blocks and the rename as some point in the future, but to always do it in that order. Atomic rename without that guarantee is far less useful, and explicit syncing all the time will kill performance.

These semantics are safe and useful! They don't impact performance much because the applications that need the fastest block allocated -- databases and such -- already turn off as much caching as possible and do that work internally.

Atomic-in-the-future commits may go beyond a narrow reading of POSIX, but that's not a bad thing. Are you saying that we cannot improve on POSIX?

ext4 and data loss

Posted Mar 13, 2009 0:27 UTC (Fri) by dlang (guest, #313) [Link] (20 responses)

what's needed is the ability for the code to insert a barrier, saying 'I want everything before this point done before you do anything after this point'

anything else is guesswork by the OS.

ext4 and data loss

Posted Mar 13, 2009 0:46 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

I didn't think POSIX filesystem operations were allowed to be executed out
of order. I've never read any code, no matter how old, that took any
measures to allow for this.

ext4 and data loss

Posted Mar 13, 2009 0:49 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (1 responses)

They're only out of order from the point of view of a system recovering from a crash. Otherwise, there'd be an even larger furor. :-) I think the way rename is used, it was always intended to have barrier semantics, and sane filesystems should respect that intent.

ext4 and data loss

Posted Mar 13, 2009 7:58 UTC (Fri) by nix (subscriber, #2304) [Link]

Strongly seconded. I suppose nothing is ever really *guaranteed* about
post-crash state, so this is merely a QoI, but an important one.

(Memories of the Algol 68 standard, I think it was, going to some lengths
to define the behaviour of the system under unspecified circumstances in
which the power was cut, which were meant to include things like
earthquakes.)

ext4 and data loss

Posted Mar 13, 2009 0:47 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (16 responses)

I agree. An explicit barrier interface would be nice. Right now, however, rename-onto-an-existing-file almost always expresses the intent to create such a barrier, and the filesystem should respect that intent. In practice, it's nearly always worked that way. UFS with soft-updates guarantees data blocks are flushed before metadata ones. ZFS goes well beyond that and guarantees the relative ordering of every write. And the vast majority of the time, on ext3, an atomic rename without an fsync has the same effect as it does on these other filesystems.

Other filesystems work like ext4 does, yes. Consider XFS, which has a much smaller user base than it should, given its quality. Why is that the case? It has a reputation for data loss --- and for good reason. IMHO, it's ignoring an implied barriers created by atomic renames!

Forcing a commit of data before rename-onto-an-existing-file not only allows applications running today to work correctly, but creating an implied barrier on rename provides a very elegant way to detect the barrier the application developer almost certainly meant to write, but couldn't.

ext4 and data loss

Posted Mar 13, 2009 4:12 UTC (Fri) by flewellyn (subscriber, #5047) [Link] (15 responses)

POSIX defines an explicit barrier. It's called fsync().

ext4 and data loss

Posted Mar 13, 2009 7:57 UTC (Fri) by nix (subscriber, #2304) [Link] (13 responses)

And how often have you seen applications that do cross-directory rename()s
combine it with an fsync() of both the source and target directories,
without which you are risking data loss?

I've never seen anyone do it. Even coreutils 7.1 doesn't do it.

ext4 and data loss

Posted Mar 13, 2009 8:29 UTC (Fri) by flewellyn (subscriber, #5047) [Link] (12 responses)

If you just rename(), then the file will continue to exist at either the old or the new location, even if there's a crash. That's guaranteed by rename() semantics. You can't cross filesystems with it, either, so there's no I/O of the actual data.

ext4 and data loss

Posted Mar 13, 2009 14:50 UTC (Fri) by foom (subscriber, #14868) [Link] (11 responses)

> If you just rename(), then the file will continue to exist at either the old or the new location, even
> if there's a crash. That's guaranteed by rename() semantics.

Is it? If you rename from /A/file to /B/file (both on the same filesystem), what happens if the OS
decides to write out the new directory metadata for /A immediately, but delay writing /B until an
hour from now? (for performance, don't-cha-know) And then the machine crashes. So now you're
left with no file at all.

While I admit not having looked, I'll bet three cookies that's perfectly allowed by POSIX.

ext4 and data loss

Posted Mar 13, 2009 15:06 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (9 responses)

While I admit not having looked, I'll bet three cookies that's perfectly allowed by POSIX.

You know what else is also allowed by POSIX?

Rejecting filenames longer than 14 characters, or that begin with a hyphen
Refusing to create more than 8 hard links to a file
Not having job control
Copying a process's entire address space on fork
Making all IO synchronous

Come on. Adhering to POSIX is no excuse for a poor implementation! Even Windows adheres to POSIX, and you'd have to be loony to claim it's a good Unix. Look: the bare minimum durability requirements that POSIX specifies are just not sufficient for a good and reliable system. rename must introduce a write barrier with respect to the data blocks for the file involved or we will lose. Not only will you not get every programmer and his dog to insert a gratuitous fsync in the write sequence, but doing so would actually be harmful to system performance.

ext4 and data loss

Posted Mar 13, 2009 18:05 UTC (Fri) by nix (subscriber, #2304) [Link]

rename must introduce a write barrier with respect to the data blocks for the file involved or we will lose.

But this is exactly the behaviour that ext4 isn't currently implementing (although it will be, by default).

Perhaps we're in vociferous agreement, I don't know.

ext4 and data loss

Posted Mar 13, 2009 22:54 UTC (Fri) by bojan (subscriber, #14302) [Link] (7 responses)

> Not only will you not get every programmer and his dog to insert a gratuitous fsync in the write sequence, but doing so would actually be harmful to system performance.

fsync is not gratuitous. It is the D in ACID. As you mentioned yourself, rename requires only A form ACID - and that is exactly what you get.

But, Ted being a pragmatic man, reverted this to the old behaviour, simply because he knows there is a lot of broken software out there.

The fact that good applications that never lose data are already using the correct behaviour is case in point that this is how all applications should do it.

Performance implications of this approach are different than that of the old approach from ext3. In some cases ext4 will be faster. In others, it won't. But the main performance problem is bad applications that gratuitously write hundreds of small files to the file system. This is what is causing the real performance problem and should be fixed.

XFS received a lot of criticism, for what seem to be application problems. I wonder how many people lost files they were editing in emacs on that file system. I would venture a guess, not many.

ext4 and data loss

Posted Mar 13, 2009 23:10 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (6 responses)

It is the D in ACID. As you mentioned yourself, rename requires only A form ACID - and that is exactly what you get.

That's my whole point: sometimes you want atomicity without durability. rename without fsync is how you express that. Except on certain recent filesystems, it's always worked that way. ext4 not putting a write barrier before rename is a regression.

But the main performance problem is bad applications that gratuitously write hundreds of small files to the file system.

And why, pray tell, is writing files to a filesystem a bad thing? Writing plenty of small files is a perfectly legitimate use of the filesystem. If a filesystem buckles in that scenario, it's the fault of the filesystem, not the application. Blaming the application is blaming the victim.

ext4 and data loss

Posted Mar 13, 2009 23:46 UTC (Fri) by bojan (subscriber, #14302) [Link] (5 responses)

> That's my whole point: sometimes you want atomicity without durability. rename without fsync is how you express that. Except on certain recent filesystems, it's always worked that way. ext4 not putting a write barrier before rename is a regression.

Just because something worked one way in one mode of one file system, doesn't mean it is the only way it can work, nor that applications should rely on it. If you want atomicity without durability, you get it on ext4, even without Ted's most recent patches (i.e. you get the empty file). If you want durability as well, you call fsync.

> And why, pray tell, is writing files to a filesystem a bad thing?

Writing out files that have _not_ changed is a bad thing. Or are you telling me that KDE changes all of its configuration files every few minutes?

BTW, the only reason fsync is slow on ext3, is because it does sync of all files. That's something that must be fixed, because it's nonsense.

ext4 and data loss

Posted Mar 14, 2009 1:58 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (2 responses)

Just because something worked one way in one mode of one file system...

There's plenty of precedent. The original Unix filesystem worked that way. UFS works that way with soft-updates. ZFS works that way. There are plenty of decent filesystems that will provide atomic replace with rename.

...you get it on ext4, even without Ted's most recent patches (i.e. you get the empty file).

Not from the perspective of the whole operation you don't. You set out trying to replace the contents of the file called /foo/bar, atomically. If /foo/bar ends up being a zero-length file, the intended operation wasn't atomic. That's like saying you don't need any synchronization for a linked list because the individual pointer modifications are atomic. Atomic replacement of a file without forcing an immediate disk sync is something a decent filesystem should provide. Creating a write barrier on rename is an elegant way to do that.

ext4 and data loss

Posted Mar 15, 2009 6:01 UTC (Sun) by bojan (subscriber, #14302) [Link] (1 responses)

> Creating a write barrier on rename is an elegant way to do that.

Except that rename(s), as specified, never actually guarantees that.

ext4 and data loss

Posted Mar 15, 2009 6:04 UTC (Sun) by bojan (subscriber, #14302) [Link]

That should have been rename(2), of course.

ext4 and data loss

Posted Mar 14, 2009 12:53 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

Sure. Fixing fsync() being sync() on ext3 is easy, as long as you don't
mind someone else's data showing up in your partially-synced files after
reboot. Oh, wait, that's a security hole.

ext4 and data loss

Posted Mar 15, 2009 6:03 UTC (Sun) by bojan (subscriber, #14302) [Link]

Actually, in ordered mode it should be made a no-op by default. The fact that it locks the machine up is a major regression.

ext4 and data loss

Posted Mar 14, 2009 1:23 UTC (Sat) by flewellyn (subscriber, #5047) [Link]

No, because rename() is only changing the metadata. The data of the file itself has not been changed by that call.

If you were to write new data to the file and THEN call rename, a crash right afterwards might mean that the updates were not saved. But the only way you could lose the file's original data here is if you opened it with O_TRUNC, which is really stupid if you don't fsync() immediately after closing.

ext4 and data loss

Posted Mar 17, 2009 7:12 UTC (Tue) by jzbiciak (guest, #5246) [Link]

That's a bit heavy for a barrier though. A barrier just needs to ensure ordering, not actually ensure the data is on the disk. Those are distinct needs.

For example, if I use mb(), I'm assured that other CPUs will see that every memory access before mb() completed before every memory access after mb(). That's it. The call to mb() doesn't ensure that the data gets written out of the cache to its final endpoint though. So, if I'm caching, say, a portion of the video display buffer, there's no guarantee I'll see the writes I made before the call to mb() appear on the screen. Typically, though, all that's needed and desired is a mechanism to guarantee things happen in a particular order so that you move from one consistent state to the next.

The atomic-replace-by-rename carries this sort of implicit barrier in many peoples' minds, it seems. Delaying the rename until the data actually gets allocated and committed is all this application requires. It doesn't actually require the data to be on the disk.

In other words, fsync() is too big a hammer. It's like flushing the CPU cache to implement mb().

Is there an existing API that just says "keep these things in this order" without actually also spinning up the hard drive? With the move to more battery powered machines and media that wears out the more it's written to, it seems like a bad idea to ask developers to force the filesystem to do more writes.

ext4 and data loss

Posted Mar 13, 2009 0:32 UTC (Fri) by bojan (subscriber, #14302) [Link] (2 responses)

What I'm trying to say is that there already are file systems out there that work the way ext4 works. And, it seems to me from this bug report that there are applications out there that already figured out the only real way of making things both durable and atomic using POSIX calls.

As for performance, I'm not really sure why an implicit fsync that ext3 does would be faster than an explicit one done from the application, if they end up in exactly the same thing (i.e. both data and metadata being written to permanent storage). Unless this implicit fsync in ext3 is not actually the equivalent of fsync, but instead just something that works most of the time (i.e. is done in 5 second intervals, as per Ted's explanation).

ext4 and data loss

Posted Mar 13, 2009 0:58 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (1 responses)

Data-before-rename isn't just an fsync when rename is called. That's one way of implement a barrier, but far from the best. Far better would be to keep track of all outstanding rename requests, and flush the data blocks for the renamed file before the rename record is written out. The actual write can happen far in the future, and these writes can be coalesced.

Say you're updating a few hundred small files. (And before you tell me that's bad design: I disagree. A file system is meant to manage files.) If you were to fsync before renaming each one, the whole operation would proceed slowly. You'd need to wait for the disk to finish writing each file before moving on to the next, creating a very stop-and-go dynamic and slowing everything down.

On the other hand, if you write and rename all these files without an fsync, when the commit interval expires, the filesystem can pick up all these pending renames and flush all their data blocks at once. Then it can write all the rename records, at once, much improving the overall running time of the operation.

The whole thing is still safe because if the system dies at any point, each of the 200 configuration files will either refer to the complete old file or the complete new file, never some NULL-filled or zero-length strangelet.

ext4 and data loss

Posted Mar 13, 2009 1:16 UTC (Fri) by bojan (subscriber, #14302) [Link]

> And before you tell me that's bad design: I disagree. A file system is meant to manage files.

I don't think that's bad design either. It is very useful to build an XML tree from many small files (e.g. gconf), instead of putting everything into one big one, which, if corrupted, will bring everything down.

> The whole thing is still safe because if the system dies at any point, each of the 200 configuration files will either refer to the complete old file or the complete new file, never some NULL-filled or zero-length strangelet.

I think that's the bit Ted was complaining about. It is unusual that changes to hundreds of configuration files would have to be done all at once. Users usually change a few things at a time (which would then be OK with fsync), so this must be some kind of automated thing doing it.

But, yeah, I understand what you're getting at in terms of performance of many fsync calls in a row.