Solving the ext3 latency problem

Posted Apr 16, 2009 22:23 UTC (Thu) by chad.netzer (subscriber, #4257)
In reply to: Solving the ext3 latency problem by dtlin
Parent article: Solving the ext3 latency problem

Is there a quick explanation as to why? (Or a link? I'll google for it in any case)

BTW, Documentation/filesystems/ext4.txt in current linux repo seems to contradict your statement. I can understand how delayed allocation can affect the situation (since certain data need never be written to the disk at all, even if metadata changes), but for allocated data, how does the ext4 situation differ from ext3 writeback mode?

http://lwn.net/Articles/203915/

Solving the ext3 latency problem

Posted Apr 17, 2009 6:52 UTC (Fri) by bojan (subscriber, #14302) [Link]

Yeah, confusing isn't it. Relevant part of the docs, diffed:

 * writeback mode
-In data=writeback mode, ext3 does not journal data at all.  This mode provides
+In data=writeback mode, ext4 does not journal data at all.  This mode provides
 a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
 mode - metadata journaling.  A crash+recovery can cause incorrect data to
 appear in files which were written shortly before the crash.  This mode will
-typically provide the best ext3 performance.
+typically provide the best ext4 performance.

It would be really good if Ted could comment if the above was simply copied from ext3 docs or if it is really still true for ext4 in writeback mode as well.

Solving the ext3 latency problem

Posted Apr 18, 2009 16:14 UTC (Sat) by sbergman27 (guest, #10767) [Link] (3 responses)

From Ted, in a relatively recent interview, speaking on *ext4* performance tips:

"If you dont need the security guarantees of what happens after a crash that are provided by data=ordered, try using the data=writeback mount option."

http://www.linux-mag.com/id/7272/2/

Solving the ext3 latency problem

Posted Apr 18, 2009 23:22 UTC (Sat) by bojan (subscriber, #14302) [Link] (2 responses)

Compare that to this comment:

Fundamentally, the problem is caused by data=ordered mode. This problem can be avoided by mounting the filesystem using data=writeback or by using a filesystem that supports delayed allocation such as ext4. This is because if you have a small sqllite database which you are fsync(), and in another process you are writing a large 2 megabyte file, the 2 megabyte file wont be be allocated right away, and so the fsync operation will not force the dirty blocks of that 2 megabyte file to disk; since the blocks havent been allocated yet, there is no security issue to worry about with the previous contents of newly allocated blocks if the system were to crash at that point.

Contradictory, isn't it?

http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/

Solving the ext3 latency problem

Posted Apr 19, 2009 0:59 UTC (Sun) by sbergman27 (guest, #10767) [Link]

I guess.

On a related note, if he thinks that writeback is good enough for ext3 because, after all, nobody runs Linux with multiple users... then is writeback also destined to be the default for ext4? Or is the idea to destabilize the thus far rock solid ext3 enough to make ext4 look better by comparison?

Solving the ext3 latency problem

Posted Apr 19, 2009 2:56 UTC (Sun) by sitaram (guest, #5959) [Link]

Actually, maybe not. This is not about security; as far as security goes ext4 is the same. What he is saying (or what I understand him to be saying) is that in ext4, due to delayed allocation, the performance issue with data=ordered is alleviated.

So I'd see this as "delayed allocation makes ordered almost as efficient as writeback", not "...makes writeback as secure as ordered"

Solving the ext3 latency problem

Posted Apr 19, 2009 4:27 UTC (Sun) by tytso (subscriber, #9993) [Link] (5 responses)

OK, so there are multiple issues when people talk about "safety" and data=writeback. First of all, for ext3. In the security dimension, with data=writeback, if after a crash where the filesystem was not cleanly unmounted, files that were written right before the crash might contain unitialized data. This data could be from another user, although on a single-user system, this is obviously not very likely. How severely you consider this really depends on how paranoid you are. Even on a single-user system, this data might contain information that you don't want to send out publically, and if you don't notice that a file contained something other than what you thought before you send it out to someone else, there is a potential for a security exposure. Obviously, though on a single user system this is much less of an issue than on timesharing system.

In the data loss department, if you have an application that didn't use fsync(), and the system crashes, with data=writeback there is the chance for dataloss. In 2.6.30, Linus accepted patches which will cause an implied flush operation when a hueristic detects an application trying to replace an existing file via the replace-via-truncate and replace-via-rename cases patterns. This largely reduces the problems for non-fsync-using applications. It doesn't solve the problem for a freshly written file, but the system could have easily crashed five seconds earlier.

OK, so how does ext4 change things. By default ext4 on modern kernels (ignoring the technology preview on RHEL 5 and Fedora 10) performs delayed allocation. This means that the data blocks are not allocated right away when you write the file, but only when they are forced out, either explicitly via fsync(), or via the page writeback algorithms in the VM, which will tend to push things out after 30-45 seconds (ignoring laptop mode) and perhaps sooner if the system is short on memory.

In the security dimension, what this means is that even in data=writeback mode, in general on a crash the file will be truncated or zero-length instead of containing uninitialized data. In ext4 with delayed allocation and data=writeback, there *is* a very tiny race condition where if a transaction closes right between when the pdflush daemon allocates the filesystem block and before it has a chance to trigger the page writeback, that you might end up with uninitialized garbage. This chance is very small, but it is non-zero. In this case, ext4 data=ordered will force the write to disk, so it is technically safer in the security dimension, although this race is very hard to exploit, and very rare that it gets hit in practice. (This is also why the overhead of data=ordered and data=writeback is much less for ext4, thanks to delayed allocation --- the difference between the two is not the same, however!)

In the safety against applications that don't use fsync department, as of 2.6.30, ext4 will always do an implied allocation and flush for data=ordered and data=writeback. So there is no real difference here between data=ordered and data=writeback.

The bottom line is that while there is some performance benefit in going with data=writeback with ext4, the differences between data=ordered and data=writeback are much smaller with ext4, in both the cost and benefit dimensions.

Chris Mason is also working on a data=guarded mode, which will cause files to be truncated (much like delayed allocation) on a crash with ext3. I will look into porting this mode into ext4, if it proves to be enough of a performance advantage for ext4 over data=ordered, and yet providing a tiny bit more safety than data=writeback. It's not clear to me that it will be worth it for ext4, however.

I hope this helps answers the questions between ext3 and ext4, and data=ordered versus data=writeback.

Regards,

Ted.

Solving the ext3 latency problem

Posted Apr 19, 2009 8:07 UTC (Sun) by sitaram (guest, #5959) [Link]

Thank you...

I'm one of those people for whom the security aspect is far more important (*) than data loss -- data loss can happen for so many other reasons that one should have a good, reliable, backup regime anyway, so one more reason doesn't bother me.

So ext3: people with my mindset should stick with data=ordered. (I don't see guarded as being too useful for ext3 -- we'll probably have switched to ext4 by the time guarded becomes mainstream).

Ext4: I think I'll stick with ordered here too. If the overhead has been much reduced by delayed alloc, it correspondingly reduces the main advantage of writeback too :-) I'd rather err on the side of security when the difference is minor.

Although collectively we like choice, and we *need* choice, when it comes to actual usage, we have to rationally reduce the many choices available into one and say "*this* is what we will use"!

Thanks once again for jumping in and helping with that!

Sitaram

(*) My home desktop is used by my kids also, for instance -- so it *is* a multi-user machine in the old traditional sense. The work machine runs email and office apps as one user, and my web browser and IRC as another user (simultaneously), so -- while both users are still me -- it too is multi user in the sense of wanting to keep two disparate sets of files separate.

Solving the ext3 latency problem

Posted Apr 19, 2009 22:35 UTC (Sun) by bojan (subscriber, #14302) [Link]

Thank you kindly for you detailed reply.

Solving the ext3 latency problem

Posted Apr 19, 2009 22:50 UTC (Sun) by bojan (subscriber, #14302) [Link] (2 responses)

> Chris Mason is also working on a data=guarded mode, which will cause files to be truncated (much like delayed allocation) on a crash with ext3. I will look into porting this mode into ext4, if it proves to be enough of a performance advantage for ext4 over data=ordered, and yet providing a tiny bit more safety than data=writeback. It's not clear to me that it will be worth it for ext4, however.

Unless there is a significant performance penalty by updating metadata only after the data has been written, instead of having another mode, this is probably how writeback mode should work.

Solving the ext3 latency problem

Posted Apr 20, 2009 0:29 UTC (Mon) by tytso (subscriber, #9993) [Link] (1 responses)

Unless there is a significant performance penalty by updating metadata only after the data has been written, instead of having another mode, this is probably how writeback mode should work.

(Note that data=guarded is only deferring the update of i_size, and not any other form of metadata.)

We'll have to benchmark it and see. It does mean that i_size gets updated more, and so that means that the inode has to get updated as blocks are staged out to disk, so that means some extra writes to the journal and inode table. I don't think it should be noticeable, at least for most workloads, since it should be lost in the noise of the data block I/O, but it is extra seeks and extra writes.

Solving the ext3 latency problem

Posted Apr 20, 2009 3:43 UTC (Mon) by bojan (subscriber, #14302) [Link]

Thanks for the explanation.

I guess if i_size could be updated just once, when all the blocks are pushed out, then this would be even less of a problem. But, then again, I have no idea how this actually works inside the code, so this suggestion is probably naive.