Solving the ext3 latency problem

Posted Apr 14, 2009 18:40 UTC (Tue) by mrshiny (subscriber, #4266)
Parent article: Solving the ext3 latency problem

I don't understand this:

There is one other important change needed to get a truly quick fsync() with ext3, though: the filesystem must be mounted in data=writeback mode.

Is this because the changes to fsync are disabled in data=ordered, or just because the performance gains are small compared to the overhead of data=ordered?

I'm curious because if fsync is slow application developers won't use it, even if on some systems it's fast. It will be years before application developers start using it "properly" again.

Solving the ext3 latency problem

Posted Apr 14, 2009 20:48 UTC (Tue) by elanthis (guest, #6227) [Link]

data=ordered is the existing slow behavior.

Solving the ext3 latency problem

Posted Apr 16, 2009 5:22 UTC (Thu) by butlerm (subscriber, #13312) [Link] (3 responses)

"data=ordered" provides nearly ideal recovery semantics, but it is stronger
than necessary to provide reasonable recovery behavior in most cases. A
strict interpretation of data=ordered means committing dirty data to disk
before any meta data updates. That means that calling fsync on any file
with dirty buffers is equivalent in cost to calling fsync on every file
with dirty buffers in the filesystem.

Since data=ordered tends to interfere with getting real work done without
stalling the question is what kinds of relaxations can be made without
imperiling the integrity of your filesystem. "data=writeback" is the no
holds barred assume your system is never going to crash tough luck for any
recently touched files but you probably won't have to spend hours waiting
for fsck sort of preference.

Fortunately, there is a lot of room for reasonable, safer relaxations
between data=ordered and data=writeback. data=guarded is one such option
that allows preliminary meta data commits for unrelated files to proceed
with a smaller file size corresponding to the file data that has actually
been written to disk. That works really well as long as you are not trying
to replace an existing file. If you are doing rename replacements the same
problem comes back to haunt you in a way that data=guarded doesn't solve.
(Rename undo would...)

Solving the ext3 latency problem

Posted Apr 16, 2009 9:42 UTC (Thu) by nye (guest, #51576) [Link]

However, in the case of file replacement via rename or truncate, ext3 (and ext4) will now flush the data to disk before the associated metadata anyway, even using data=writeback, so data=guarded does indeed solve that problem.

Solving the ext3 latency problem

Posted Apr 17, 2009 14:03 UTC (Fri) by anton (subscriber, #25547) [Link] (1 responses)

A strict interpretation of data=ordered means committing dirty data to disk before any meta data updates.

I'm not sure I agree, but anyway, if it behaves that way, that's fine with me. I like my data not only on disk, but also internally consistent.

"data=writeback" is the no holds barred assume your system is never going to crash [...] sort of preference.

But if I assume my system is never going to crash, why would I be using fsync()? And why should a file system that works based on that assumption do anything when the application calls fsync()?

Fortunately, there is a lot of room for reasonable, safer relaxations between data=ordered and data=writeback.

I would actually prefer to see something stricter than data=ordered. Something that gives me the guarantee that the state after a crash corresponds to some logical state of the file system before the crash.

Until I get that, I'll just go for data=ordered and hope that the Linux developers don't break it like they did with data=journal.

Solving the ext3 latency problem

Posted Nov 10, 2009 12:00 UTC (Tue) by schabi (guest, #14079) [Link]

I would actually prefer to see something stricter than data=ordered. Something that gives me the guarantee that the state after a crash corresponds to some logical state of the file system before the crash.

You always have the option to mount with "data=journal" - this is the safest and slowest mode with ext3. And don't forget that RAID5 / RAID6 will break all barrier / journal semantics for all filesystems.