Solving the ext3 latency problem

By Jonathan Corbet
April 14, 2009

One might think that the ext3 filesystem, by virtue of being standard on almost all installed Linux systems for some years now, would be reasonably well tuned for performance. Recent events have shown, though, that some performance problems remain in ext3, especially in places where the fsync() system call is used. It's impressive what can happen when attention is drawn to a problem; the 2.6.30 kernel will contain fixes which seemingly eliminate many of the latencies experienced by ext3 users. This article will look at the changes that were made, including a surprising change to the default journaling mode made just before the 2.6.30-rc1 release.

The problem, in short, is this: the ext3 filesystem, when running in the default data=ordered mode, can exhibit lengthy stalls when some process calls fsync() to flush data to disk. This issue most famously manifested itself as the much-lamented Firefox system-freeze problem, but it goes beyond just Firefox. Anytime there is reasonably heavy I/O going on, an fsync() call can bring everything to a halt for several seconds. Some stalls on the order of minutes have been reported. This behavior has tended to discourage the use of fsync() in applications and it makes the Linux desktop less fun to use. It's clearly worth fixing - but nobody did that for years.

When Ted Ts'o looked into the problem, he noticed an obvious problem: data sent to the disk via fsync() is put at the back of the I/O scheduler's queue, behind all other outstanding requests. If processes on the system are writing a lot of data, that queue could be quite long. So it takes a long time for fsync() to complete. While that is happening, other parts of the filesystem can stall, eventually bringing much of the system to a halt.

The first fix was to mark I/O requests generated by fsync() with the WRITE_SYNC operation bit, marking them as synchronous requests. The CFQ I/O scheduler tries to run synchronous requests (which generally have a process waiting for the results) ahead of asynchronous ones (where nobody is waiting). Normally, reads are considered to be synchronous, while writes are not. Once the fsync()-related requests were made synchronous, they were able to jump ahead of normal I/O. That makes fsync() much faster, at the expense of slowing down the I/O-intensive tasks in the system. This is considered to be a good tradeoff by just about everybody involved. (It's amusing to note that this change is conceptually similar to the I/O priority patch posted by Arjan van de Ven some time ago; some ideas take a while to reach acceptance).

Block subsystem maintainer Jens Axboe disliked the change, though, stating that it would cause severe performance regressions for some workloads. Linus made it clear, though, that the patch was probably going to go in, and that, if the CFQ I/O scheduler couldn't handle it, there would soon be a change to a different default scheduler. Jens probably would have looked further in any case, but the extra motivation supplied by Linus is unlikely to have slowed this process down.

The problem, as it turns out, is that WRITE_SYNC actually does two things: putting the request onto the higher-priority synchronous queue, and unplugging the queue. "Plugging" is the technique used by the block layer to issue requests to the underlying disk driver in bursts. Between bursts, the queue is "plugged," causing requests to accumulate there. This accumulation gives the I/O scheduler an opportunity to merge adjacent requests and issue them in some sort of reasonable order. Judicious use of plugging improves block subsystem performance significantly.

Unplugging the queue for a synchronous request can make sense in some situations; if somebody is waiting for the the operation, chances are they will not be adding any adjacent requests to the queue, so there is no point in waiting any longer. As it happens, though, fsync() is not one of those situations. Instead, a call to fsync() will usually generate a whole series of synchronous requests, and the chances of those requests being adjacent to each other is fairly good. So unplugging the queue after each synchronous request is likely to make performance worse. Upon identifying this problem, Jens posted a series of patches to fix it. One of them adds a new WRITE_SYNC_PLUG operation which queues a synchronous write without unplugging the queue. This allows operations like fsync() to create a series of operations, then unplug the queue once at the end.

While he was at it, Jens fixed a couple of related issues. One was that the block subsystem can still sometimes run synchronous requests behind asynchronous requests in some situations. The code here is a bit tricky, since it may be desirable to let a few asynchronous requests through occasionally to prevent them from being starved entirely. Jens changed the balance to ensure that synchronous requests get through in a timely manner.

Beyond that, the CFQ scheduler uses "anticipatory" logic with synchronous requests; upon executing one such request, it will stall the queue to see if an adjacent request shows up. The idea is that the disk head will be ideally positioned to satisfy that request, so the best performance is obtained by not moving it away immediately. This logic can work well for synchronous reads, but it's not helpful when dealing with write operations generated by fsync(). So now there's a new internal flag that prevents anticipation when WRITE_SYNC_PLUG operations are executed.

Linus liked the changes:

Goodie. Let's just do this. After all, right now we would otherwise have to revert the other changes as being a regression, and I absolutely _love_ the fact that we're actually finally getting somewhere on this fsync latency issue that has been with us for so long.

It turns out that there's a little more, though. Linus noticed that he was still getting stalls, even if they were much shorter than before, and he wondered why:

One thing that I find intriguing is how the fsync time seems so _consistent_ across a wild variety of drives. It's interesting how you see delays that are roughly the same order of magnitude, even though you are using an old SATA drive, and I'm using the Intel SSD.

The obvious conclusion is that there was still something else going on. Linus's hypothesis was that the volume of requests pending to the drive was large enough to cause stalls even when the synchronous requests go to the front of the queue. With a default configuration, requests can contain up to 512KB of data; stack up a couple dozen or so of those, and it's going to take the drive a little while to work through them. Linus experimented with setting the maximum size (controlled by /sys/block/drive/queue/max_sectors_kb) to 64KB, and reports that things worked a lot better. As of this writing, though, the default has not been changed; Linus suggested that it might make sense to cap the maximum amount of outstanding data, rather than the size of any individual request. More experimentation is called for.

There is one other important change needed to get a truly quick fsync() with ext3, though: the filesystem must be mounted in data=writeback mode. This mode eliminates the requirement that data blocks be flushed to disk ahead of metadata; in data=ordered mode, instead, the amount of data to be written guarantees that fsync() will always be slower. Switching to data=writeback eliminates those writes, but, in the process, it also turns off the feature which made ext3 seem more robust than ext4. Ted Ts'o has mitigated that problem somewhat, though, by adding in the same safeguards he put into ext4. In some situations (such as when a new file is renamed on top of an existing file), data will be forced out ahead of metadata. As a result, data loss resulting from a system crash should be less of a problem.

Sidebar: data=guarded

Another alternative to data=ordered may be the data=guarded mode proposed by Chris Mason. This mode would delay size updates to prevent information disclosure problems. It is a very new patch, though, which won't be ready for 2.6.30.

The other potential problem with data=writeback is that, in some situations, a crash can leave a file with blocks allocated to it which have not yet been written. Those blocks may contain somebody else's old data, which is a potential security problem. Security is a smaller issue than it once was, for the simple reason that multiuser Linux systems are relatively scarce in 2009. In a world where most systems are dedicated to a single user, the potential for information disclosure in the event of a crash seems vanishingly small. In other words, it's not clear that the extra security provided by data=ordered is worth the associated performance costs anymore.

So Ted suggested that, maybe, data=writeback should be made the default. There was some resistance to this idea; not everybody thinks that ext3, at this stage of its life, should see a big option change like that. Linus, however, was unswayed by the arguments. He merged a patch which creates a configuration option for the default ext3 data mode, and set it to "writeback." That will cause ext3 mounts to silently switch to data=writeback mode with 2.6.30 kernels. Says Linus:

I'm expecting that within a few months, most modern distributions will have (almost by mistake) gotten a new set of saner defaults, and anybody who keeps their machine up-to-date will see a smoother experience without ever even realizing _why_.

It's worth noting that this default will not change anything if (1) the data mode is explicitly specified when the filesystem is mounted, or (2) a different mode has been wired into the filesystem with tune2fs. It will also be ineffective if distributors change it back to "ordered" when configuring their kernels. Some distributors, at least, may well decide that they do not wish to push that kind of change to their users. We will not see the answer to that question for some months yet.

Index entries for this article
Kernel	Filesystems/ext3

Solving the ext3 latency problem

Posted Apr 14, 2009 16:50 UTC (Tue) by PO8 (guest, #41661) [Link]

I will be forcing my ext3 filesystems to stay data=ordered until Chris Mason's cool data=guarded patch is working.

The argument that "it's a single-user system, so who cares" seems to me to be crazy talk? Like many Linux users I run Apache + database instances that allow anonymous users anywhere on the net to access some of my dynamically-updated files. I may be mistaken, but it looks to me like the current data=writeback mode gives an increased opportunity to disclose things like my database's (foolishly) unencrypted password or my personal email to the whole web after a crash. Not OK, regardless of the shorter sync times.

Solving the ext3 latency problem

Posted Apr 14, 2009 16:56 UTC (Tue) by nye (guest, #51576) [Link] (3 responses)

data=guarded sounds rather interesting, but I'm not sure I understand how it differs from data=ordered. In what situations could the result be different? The special-casing for rename means that renames should have the same behaviour, and data=guarded appears to mean that resizes have the same behaviour (or does it mean that both data and metadata can now be further delayed than data=ordered?).

IIUC, I suppose there could be things like *new* files being created with zero size after a crash, rather than not being created at all, which doesn't seem like the end of the world. I admit I haven't actually thought this through very much at all yet, so that could be nonsense for both behaviours. :P

What other reliability/data integrity implications would this have?

Solving the ext3 latency problem

Posted Apr 14, 2009 18:09 UTC (Tue) by corbet (editor, #1) [Link] (2 responses)

data=guarded sounds rather interesting, but I'm not sure I understand how it differs from data=ordered. In what situations could the result be different?

data=ordered forces all data to go out before the metadata is written; in practice, that forces data to be written within five seconds. data=guarded, as I understand it, delays the writing of certain metadata (the file size in particular) until the data has been written. The timing is looser, but it still keeps random junk from showing up within a file after a crash.

Solving the ext3 latency problem

Posted Apr 14, 2009 18:51 UTC (Tue) by masoncl (subscriber, #47138) [Link]

Right, data=guarded delays pushing the new file size down to disk until the data is there first.

It still does the old style data=ordered in cases where the file size isn't enough protection (like filling holes).

Solving the ext3 latency problem

Posted Apr 16, 2009 15:28 UTC (Thu) by sandeen (guest, #42852) [Link]

Also, in practice (and as evidenced by my testing), data=guarded will leave you with smaller file sizes (potentially 0-length) when you crash. But that is better than having garbage (potentially other people's data) in those files.

The rename & truncate hacks will help flush some data, but if you, say, untar a kernel tree and crash, those hacks don't come into play.

data=writeback would wind up garbage in the files, data=guarded would wind up with 0-length or shortened files with no garbage, and data=ordered would likely have more files intact due to the journal transaction commits causing more flushing along the way... and no garbage.

Solving the ext3 latency problem

Posted Apr 14, 2009 17:14 UTC (Tue) by MisterIO (guest, #36192) [Link] (2 responses)

Is this simply a way to say : "Just update the fs to ext4!"? After this what reason will we have to go on using ext3 instead of ext4?

Solving the ext3 latency problem

Posted Apr 14, 2009 20:46 UTC (Tue) by elanthis (guest, #6227) [Link] (1 responses)

Stability and safety. Production environments are going to be more comfortable with a small incremental change to their existing FS than they are in switching and potentially being bitten by one of the many major changes in ext4.

Solving the ext3 latency problem

Posted Apr 16, 2009 19:35 UTC (Thu) by jordanb (guest, #45668) [Link]

I think the idea is that ext3 is being made less reliable so that it will no longer be more reliable than ext4, to try to remove that argument against "upgrading."

Solving the ext3 latency problem

Posted Apr 14, 2009 17:25 UTC (Tue) by bronson (subscriber, #4806) [Link] (1 responses)

Great article! It's nice to see that all this discussion is coming to a sane conclusion.

It's a little disturbing to see data=writeback becoming the default. Is the performance gain really so great that it's worth being less secure by default?

Solving the ext3 latency problem

Posted Apr 15, 2009 9:34 UTC (Wed) by edschofield (guest, #39993) [Link]

Yes, great article. The lkml discussion was also fascinating.

We have _severe_ latency issues with ext3 and our RAID arrays, sometimes causing our servers to appear to freeze for tens of seconds during disk writes. A safer 'writeback' mode that eliminates these latencies will be a huge win for us.

Solving the ext3 latency problem

Posted Apr 14, 2009 17:25 UTC (Tue) by spot (guest, #15640) [Link]

Worth noting that Fedora has decided to go with data=ordered as the default in Rawhide, given the serious concerns around data=writeback.

Solving the ext3 latency problem

Posted Apr 14, 2009 17:27 UTC (Tue) by yusufg (guest, #407) [Link]

So what's the default mode for ext4 and are they any similar gotchas one has to be aware of ?

Solving the ext3 latency problem

Posted Apr 14, 2009 17:55 UTC (Tue) by mgb (guest, #3226) [Link] (14 responses)

"multiuser Linux systems are relatively scarce in 2009"

We run a bunch of Linux mail servers, and we ain't the only ones.

Maybe somebody should tell Linus how Linux is being used.

Solving the ext3 latency problem

Posted Apr 14, 2009 18:11 UTC (Tue) by malor (guest, #2973) [Link] (5 responses)

It can be argued that most servers that provide more than one service are 'multiuser' machines.

Solving the ext3 latency problem

Posted Apr 14, 2009 20:10 UTC (Tue) by drag (guest, #31333) [Link] (4 responses)

Personally I would think that multiuser actually means having multiple people with users using the computer at the same time.

That is they are logged onto the machine and are doing something on them.. programming, editing, web browsing, etc.

So your dealing with computers with multiple monitors with multiple users logged in at once, or LTSP, or people that still sell shell accounts. All of which are fairly rare compared to personal desktops, embedded systems, or most server systems.

Solving the ext3 latency problem

Posted Apr 14, 2009 23:24 UTC (Tue) by ktanzer (guest, #6073) [Link] (1 responses)

Multi-user systems may be relatively rare, but as someone who was responsible for multiple LTSP deployments, I shudder at the thought that security in such cases would be cast aside or relegated to an afterthought. LTSP is a great way to use old hardware. For that matter, any home machine used by the whole family would also be multi-user (assuming people set up their own logins).

On a more general note, Linux has inherited rich multi-user capabilities from Unix, and I hate to see that atrophy over time. As one example, it is very easy to find information on how to configure popular programs such as Firefox, KDE, OpenOffice, etc. for a single user, but often maddeningly hard to determine how to configure on a system-wide basis. The fact that FF can't run multiple instances from a single profile is not technically a multi-user issue, but also drives me up the wall...

Solving the ext3 latency problem

Posted Apr 15, 2009 1:25 UTC (Wed) by drag (guest, #31333) [Link]

Well I can agree with that.

I don't know exactly why, but I have a feeling that multiuser systems will become increasingly important in the future.

Something will come along... like the acceptance of IPv6 and the decline of the "Personal Computer"-ing inflicted client-server relationship and the internet will return to it's P2P roots. (for reasons of scalability, robustness, expense). If something like that were to happen and people realized that having mobile computers could become essentially disposible if they turned into little more then terminals for the 'big' computer at home or clusters at work... Then multiuser systems could become common place again.

Weirder things have happenned.

Solving the ext3 latency problem

Posted Apr 15, 2009 4:35 UTC (Wed) by malor (guest, #2973) [Link]

Typically, different services run under different user accounts, and rely on the system's security features to keep the two separate.

It would be bad if a bug in the mail server gave access to, say, deleted .htaccess files, or part of a SQL database.

All Unix systems are inherently multiuser, and sabotaging inter-account security features is deliberately cutting away one layer of the net that can catch you if a bug exposes an attack vector.

Solving the ext3 latency problem

Posted Apr 15, 2009 5:55 UTC (Wed) by hawk (subscriber, #3195) [Link]

It may not be "multi-user" by that definition but the exact definition of "multi-user" isn't really that interesting for the actual argument, is it?

Isn't the point that as soon as soon as there are multiple users (no matter if they are logged in using a system account or accessing the system through some other means, eg HTTP, authenticated in some way or even anonymously), there would be a chance that one user's data (or data "belonging to the system") could leak into a file which will be accessible by another user.

So in the case of the the system crashing, a file publicly available on the web, or some logged in user(s), might end up containing anything that has previously been deallocated if that file was being modified, be it by the site administrator, or by a random user on a site where you can for instance upload an image to include in your content.

I would think this definitely falls within the "common usage"-realm for Linux systems and that whoever made the argument may not have really thought it through.

(Or I'm just not understanding in what scenarios something like this could actually happen.)

Solving the ext3 latency problem

Posted Apr 14, 2009 18:16 UTC (Tue) by smoogen (subscriber, #97) [Link] (1 responses)

You may run such a server (I do too... ) but nowadays it is a blip in the data compared to routers, switches, music boxes, GPS systems, laptops, desktops, workstations that run Linux. In those cases, worrying about data from some other user is less likely.

Also I think it was the editor not Linus who said that.

Solving the ext3 latency problem

Posted Apr 17, 2009 18:21 UTC (Fri) by giraffedata (guest, #1954) [Link]

nowadays it is a blip in the data compared to routers, switches, music boxes, GPS systems, laptops, desktops, workstations that run Linux.

I was going to say the opposite. I think you're making a statement about the number of Linux kernel images running, but I don't think that's a useful measure of prevalence as it relates to the cost of assuming a system is single-user.

On the contrary, I believe the great majority of Linux is multiuser servers and the personal computers and appliances you mention are a blip. I'm looking at the amount of filesystem access that happens.

When I say "multiuser" I'm considering a user to be a person, not a uid.

Incidentally, routers and switches (from your list) are multiuser systems. Consequently, there is a security issue in sending data to the wrong user.

Multiuser quite important still: remote users on Windows PC:s

Posted Apr 15, 2009 3:53 UTC (Wed) by eru (subscriber, #2753) [Link] (4 responses)

The big company I'm working for actually has most of its interactive Linux users on multi-user servers: This is because everyone is "of course" supplied with a Windows PC, but Linux is preferred for software development for several products, so the developers access Linux servers with X11 emulator or VNC running on the PC. This also makes it easier to maintain a consistent development environment for the users. Some people do have Linux workstations, but these are a minority.

I don't know how typical this kind of use is, but I suspect it common in technology companies needing a Linux development environment for some users but not wanting or being able to go all the way to Linux desktops.

Multiuser quite important still: remote users on Windows PC:s

Posted Apr 15, 2009 16:05 UTC (Wed) by chema (subscriber, #32636) [Link]

Same here.

Our development environment is a mixture of Windows desktop PCs (running some development tools + "corporate" applications) and Linux servers (providing: ssh + X11 fwd + http + samba + ...).

It used to be HP-UX/Solaris <-> Windows but we happily migrated to linux a year ago.

Multiuser quite important still: remote users on Windows PC:s

Posted Apr 15, 2009 20:18 UTC (Wed) by PhracturedBlue (subscriber, #4193) [Link] (2 responses)

Yes, but are you using EXT3 as the filesystem on those machines? In many (most?) multi-user systems, you're likely to have a big fileserver serving files via NFS or equivalent to various servers. I don't know if your company is set up that way, but I'd think it would be common for large corporations with many users.

Multiuser quite important still: remote users on Windows PC:s

Posted Apr 15, 2009 20:29 UTC (Wed) by mgb (guest, #3226) [Link]

We used NFS like that back around '95 or '97 but it's way too slow for general use these days although there are undoubtedly still specialized applications where it's useful.

So yes, we use ext3 for mail servers and web servers etc. All of which are multi-remote-user.

Multiuser quite important still: remote users on Windows PC:s

Posted Apr 16, 2009 4:29 UTC (Thu) by eru (subscriber, #2753) [Link]

Yes, but are you using EXT3 as the filesystem on those machines? In many (most?) multi-user systems, you're likely to have a big fileserver serving files via NFS or equivalent to various servers.

Yes and no: The home directories of the users are normally mounted via NFS (the NFS servers are not always Linux: NetApps and and Solaris boxes are also used), but the Linux servers (usually RHEL) to which people log in use ext3. Because of the various local shared directories, multiuser issues in ext3 are still relevant.

Solving the ext3 latency problem

Posted Apr 16, 2009 0:46 UTC (Thu) by hazelsct (guest, #3659) [Link]

Indeed.

Another "multi-user" use case on my "single-user" laptop is having a separate account for downloaded software (government contracts sometimes require such things), for which I don't want to take the risk of polluting the rest of my system. I install in wine in this separate account, and share (minimal) data as necessary. Security matters.

[OT: it's astounding to me how well wine runs a *lot* of Windoze software these days!!]

Solving the ext3 latency problem

Posted Apr 14, 2009 18:40 UTC (Tue) by mrshiny (subscriber, #4266) [Link] (5 responses)

I don't understand this:

There is one other important change needed to get a truly quick fsync() with ext3, though: the filesystem must be mounted in data=writeback mode.

Is this because the changes to fsync are disabled in data=ordered, or just because the performance gains are small compared to the overhead of data=ordered?

I'm curious because if fsync is slow application developers won't use it, even if on some systems it's fast. It will be years before application developers start using it "properly" again.

Solving the ext3 latency problem

Posted Apr 14, 2009 20:48 UTC (Tue) by elanthis (guest, #6227) [Link]

data=ordered is the existing slow behavior.

Solving the ext3 latency problem

Posted Apr 16, 2009 5:22 UTC (Thu) by butlerm (subscriber, #13312) [Link] (3 responses)

"data=ordered" provides nearly ideal recovery semantics, but it is stronger
than necessary to provide reasonable recovery behavior in most cases. A
strict interpretation of data=ordered means committing dirty data to disk
before any meta data updates. That means that calling fsync on any file
with dirty buffers is equivalent in cost to calling fsync on every file
with dirty buffers in the filesystem.

Since data=ordered tends to interfere with getting real work done without
stalling the question is what kinds of relaxations can be made without
imperiling the integrity of your filesystem. "data=writeback" is the no
holds barred assume your system is never going to crash tough luck for any
recently touched files but you probably won't have to spend hours waiting
for fsck sort of preference.

Fortunately, there is a lot of room for reasonable, safer relaxations
between data=ordered and data=writeback. data=guarded is one such option
that allows preliminary meta data commits for unrelated files to proceed
with a smaller file size corresponding to the file data that has actually
been written to disk. That works really well as long as you are not trying
to replace an existing file. If you are doing rename replacements the same
problem comes back to haunt you in a way that data=guarded doesn't solve.
(Rename undo would...)

Solving the ext3 latency problem

Posted Apr 16, 2009 9:42 UTC (Thu) by nye (guest, #51576) [Link]

However, in the case of file replacement via rename or truncate, ext3 (and ext4) will now flush the data to disk before the associated metadata anyway, even using data=writeback, so data=guarded does indeed solve that problem.

Solving the ext3 latency problem

Posted Apr 17, 2009 14:03 UTC (Fri) by anton (subscriber, #25547) [Link] (1 responses)

A strict interpretation of data=ordered means committing dirty data to disk before any meta data updates.

I'm not sure I agree, but anyway, if it behaves that way, that's fine with me. I like my data not only on disk, but also internally consistent.

"data=writeback" is the no holds barred assume your system is never going to crash [...] sort of preference.

But if I assume my system is never going to crash, why would I be using fsync()? And why should a file system that works based on that assumption do anything when the application calls fsync()?

Fortunately, there is a lot of room for reasonable, safer relaxations between data=ordered and data=writeback.

I would actually prefer to see something stricter than data=ordered. Something that gives me the guarantee that the state after a crash corresponds to some logical state of the file system before the crash.

Until I get that, I'll just go for data=ordered and hope that the Linux developers don't break it like they did with data=journal.

Solving the ext3 latency problem

Posted Nov 10, 2009 12:00 UTC (Tue) by schabi (guest, #14079) [Link]

I would actually prefer to see something stricter than data=ordered. Something that gives me the guarantee that the state after a crash corresponds to some logical state of the file system before the crash.

You always have the option to mount with "data=journal" - this is the safest and slowest mode with ext3. And don't forget that RAID5 / RAID6 will break all barrier / journal semantics for all filesystems.

Solving the ext3 latency problem

Posted Apr 16, 2009 15:18 UTC (Thu) by sandeen (guest, #42852) [Link] (17 responses)

>> Some distributors, at least, may well decide that they do not wish to
>> push that kind of change to their users. We will not see the answer to
>> that question for some months yet.

Actually, you won't have to wait that long for some. As long as data=writeback introduces a security hole by exposing other people's data on a crash[1], Fedora will not be shipping this way. Rawhide has already turned

CONFIG_EXT3_DEFAULTS_TO_ORDERED=y

on. Shipping any other default would be irresponsible.

[1] http://lkml.org/lkml/2009/4/8/802

Solving the ext3 latency problem

Posted Apr 16, 2009 19:15 UTC (Thu) by chad.netzer (subscriber, #4257) [Link] (13 responses)

Doesn't Fedora use ext4 now, by default, which uses data=writeback and thus has the security issues after a crash that ext3 data=ordered helped prevent? So while it won't regress existing ext3 users, new adopters of Redhat will default to the "insecure" behavior of ext4. Does ext4 even support data=ordered?

Solving the ext3 latency problem

Posted Apr 16, 2009 21:15 UTC (Thu) by dtlin (subscriber, #36537) [Link] (12 responses)

ext4's data=writeback mode does not suffer from the security problems that
ext3's does.

Solving the ext3 latency problem

Posted Apr 16, 2009 22:23 UTC (Thu) by chad.netzer (subscriber, #4257) [Link] (11 responses)

Is there a quick explanation as to why? (Or a link? I'll google for it in any case)

BTW, Documentation/filesystems/ext4.txt in current linux repo seems to contradict your statement. I can understand how delayed allocation can affect the situation (since certain data need never be written to the disk at all, even if metadata changes), but for allocated data, how does the ext4 situation differ from ext3 writeback mode?

http://lwn.net/Articles/203915/

Solving the ext3 latency problem

Posted Apr 17, 2009 6:52 UTC (Fri) by bojan (subscriber, #14302) [Link]

Yeah, confusing isn't it. Relevant part of the docs, diffed:

 * writeback mode
-In data=writeback mode, ext3 does not journal data at all.  This mode provides
+In data=writeback mode, ext4 does not journal data at all.  This mode provides
 a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
 mode - metadata journaling.  A crash+recovery can cause incorrect data to
 appear in files which were written shortly before the crash.  This mode will
-typically provide the best ext3 performance.
+typically provide the best ext4 performance.

It would be really good if Ted could comment if the above was simply copied from ext3 docs or if it is really still true for ext4 in writeback mode as well.

Solving the ext3 latency problem

Posted Apr 18, 2009 16:14 UTC (Sat) by sbergman27 (guest, #10767) [Link] (3 responses)

From Ted, in a relatively recent interview, speaking on *ext4* performance tips:

"If you dont need the security guarantees of what happens after a crash that are provided by data=ordered, try using the data=writeback mount option."

http://www.linux-mag.com/id/7272/2/

Solving the ext3 latency problem

Posted Apr 18, 2009 23:22 UTC (Sat) by bojan (subscriber, #14302) [Link] (2 responses)

Compare that to this comment:

Fundamentally, the problem is caused by data=ordered mode. This problem can be avoided by mounting the filesystem using data=writeback or by using a filesystem that supports delayed allocation such as ext4. This is because if you have a small sqllite database which you are fsync(), and in another process you are writing a large 2 megabyte file, the 2 megabyte file wont be be allocated right away, and so the fsync operation will not force the dirty blocks of that 2 megabyte file to disk; since the blocks havent been allocated yet, there is no security issue to worry about with the previous contents of newly allocated blocks if the system were to crash at that point.

Contradictory, isn't it?

http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/

Solving the ext3 latency problem

Posted Apr 19, 2009 0:59 UTC (Sun) by sbergman27 (guest, #10767) [Link]

I guess.

On a related note, if he thinks that writeback is good enough for ext3 because, after all, nobody runs Linux with multiple users... then is writeback also destined to be the default for ext4? Or is the idea to destabilize the thus far rock solid ext3 enough to make ext4 look better by comparison?

Solving the ext3 latency problem

Posted Apr 19, 2009 2:56 UTC (Sun) by sitaram (guest, #5959) [Link]

Actually, maybe not. This is not about security; as far as security goes ext4 is the same. What he is saying (or what I understand him to be saying) is that in ext4, due to delayed allocation, the performance issue with data=ordered is alleviated.

So I'd see this as "delayed allocation makes ordered almost as efficient as writeback", not "...makes writeback as secure as ordered"

Solving the ext3 latency problem

Posted Apr 19, 2009 4:27 UTC (Sun) by tytso (subscriber, #9993) [Link] (5 responses)

OK, so there are multiple issues when people talk about "safety" and data=writeback. First of all, for ext3. In the security dimension, with data=writeback, if after a crash where the filesystem was not cleanly unmounted, files that were written right before the crash might contain unitialized data. This data could be from another user, although on a single-user system, this is obviously not very likely. How severely you consider this really depends on how paranoid you are. Even on a single-user system, this data might contain information that you don't want to send out publically, and if you don't notice that a file contained something other than what you thought before you send it out to someone else, there is a potential for a security exposure. Obviously, though on a single user system this is much less of an issue than on timesharing system.

In the data loss department, if you have an application that didn't use fsync(), and the system crashes, with data=writeback there is the chance for dataloss. In 2.6.30, Linus accepted patches which will cause an implied flush operation when a hueristic detects an application trying to replace an existing file via the replace-via-truncate and replace-via-rename cases patterns. This largely reduces the problems for non-fsync-using applications. It doesn't solve the problem for a freshly written file, but the system could have easily crashed five seconds earlier.

OK, so how does ext4 change things. By default ext4 on modern kernels (ignoring the technology preview on RHEL 5 and Fedora 10) performs delayed allocation. This means that the data blocks are not allocated right away when you write the file, but only when they are forced out, either explicitly via fsync(), or via the page writeback algorithms in the VM, which will tend to push things out after 30-45 seconds (ignoring laptop mode) and perhaps sooner if the system is short on memory.

In the security dimension, what this means is that even in data=writeback mode, in general on a crash the file will be truncated or zero-length instead of containing uninitialized data. In ext4 with delayed allocation and data=writeback, there *is* a very tiny race condition where if a transaction closes right between when the pdflush daemon allocates the filesystem block and before it has a chance to trigger the page writeback, that you might end up with uninitialized garbage. This chance is very small, but it is non-zero. In this case, ext4 data=ordered will force the write to disk, so it is technically safer in the security dimension, although this race is very hard to exploit, and very rare that it gets hit in practice. (This is also why the overhead of data=ordered and data=writeback is much less for ext4, thanks to delayed allocation --- the difference between the two is not the same, however!)

In the safety against applications that don't use fsync department, as of 2.6.30, ext4 will always do an implied allocation and flush for data=ordered and data=writeback. So there is no real difference here between data=ordered and data=writeback.

The bottom line is that while there is some performance benefit in going with data=writeback with ext4, the differences between data=ordered and data=writeback are much smaller with ext4, in both the cost and benefit dimensions.

Chris Mason is also working on a data=guarded mode, which will cause files to be truncated (much like delayed allocation) on a crash with ext3. I will look into porting this mode into ext4, if it proves to be enough of a performance advantage for ext4 over data=ordered, and yet providing a tiny bit more safety than data=writeback. It's not clear to me that it will be worth it for ext4, however.

I hope this helps answers the questions between ext3 and ext4, and data=ordered versus data=writeback.

Regards,

Ted.

Solving the ext3 latency problem

Posted Apr 19, 2009 8:07 UTC (Sun) by sitaram (guest, #5959) [Link]

Thank you...

I'm one of those people for whom the security aspect is far more important (*) than data loss -- data loss can happen for so many other reasons that one should have a good, reliable, backup regime anyway, so one more reason doesn't bother me.

So ext3: people with my mindset should stick with data=ordered. (I don't see guarded as being too useful for ext3 -- we'll probably have switched to ext4 by the time guarded becomes mainstream).

Ext4: I think I'll stick with ordered here too. If the overhead has been much reduced by delayed alloc, it correspondingly reduces the main advantage of writeback too :-) I'd rather err on the side of security when the difference is minor.

Although collectively we like choice, and we *need* choice, when it comes to actual usage, we have to rationally reduce the many choices available into one and say "*this* is what we will use"!

Thanks once again for jumping in and helping with that!

Sitaram

(*) My home desktop is used by my kids also, for instance -- so it *is* a multi-user machine in the old traditional sense. The work machine runs email and office apps as one user, and my web browser and IRC as another user (simultaneously), so -- while both users are still me -- it too is multi user in the sense of wanting to keep two disparate sets of files separate.

Solving the ext3 latency problem

Posted Apr 19, 2009 22:35 UTC (Sun) by bojan (subscriber, #14302) [Link]

Thank you kindly for you detailed reply.

Solving the ext3 latency problem

Posted Apr 19, 2009 22:50 UTC (Sun) by bojan (subscriber, #14302) [Link] (2 responses)

> Chris Mason is also working on a data=guarded mode, which will cause files to be truncated (much like delayed allocation) on a crash with ext3. I will look into porting this mode into ext4, if it proves to be enough of a performance advantage for ext4 over data=ordered, and yet providing a tiny bit more safety than data=writeback. It's not clear to me that it will be worth it for ext4, however.

Unless there is a significant performance penalty by updating metadata only after the data has been written, instead of having another mode, this is probably how writeback mode should work.

Solving the ext3 latency problem

Posted Apr 20, 2009 0:29 UTC (Mon) by tytso (subscriber, #9993) [Link] (1 responses)

Unless there is a significant performance penalty by updating metadata only after the data has been written, instead of having another mode, this is probably how writeback mode should work.

(Note that data=guarded is only deferring the update of i_size, and not any other form of metadata.)

We'll have to benchmark it and see. It does mean that i_size gets updated more, and so that means that the inode has to get updated as blocks are staged out to disk, so that means some extra writes to the journal and inode table. I don't think it should be noticeable, at least for most workloads, since it should be lost in the noise of the data block I/O, but it is extra seeks and extra writes.

Solving the ext3 latency problem

Posted Apr 20, 2009 3:43 UTC (Mon) by bojan (subscriber, #14302) [Link]

Thanks for the explanation.

I guess if i_size could be updated just once, when all the blocks are pushed out, then this would be even less of a problem. But, then again, I have no idea how this actually works inside the code, so this suggestion is probably naive.

Solving the ext3 latency problem

Posted Aug 5, 2009 1:08 UTC (Wed) by mdkul (guest, #35333) [Link] (2 responses)

How do i find out what mode my ext3 partition is running in? (data=ordered or data=writeback?)

my tune2fs /dev/sda1 -l says

Default mount options : (none) So what does it default to if it is none?

i *don't* have CONFIG_EXT3_DEFAULTS_TO_ORDERED = y set and my documentation for 2.6.31-rc4 says (Documentation/filesystems/ext3.txt says data=ordered is default.

Thanks in advance.

Solving the ext3 latency problem

Posted Aug 5, 2009 14:16 UTC (Wed) by ABCD (subscriber, #53650) [Link] (1 responses)

You might be able to find out by reading /proc/self/mountinfo, which contains, among other things, the effective mount options for every mount the current process can see.

Solving the ext3 latency problem

Posted Aug 5, 2009 16:52 UTC (Wed) by mdkul (guest, #35333) [Link]

Thanks that helped.

Changing ext3's default data= mode

Posted Apr 25, 2009 10:31 UTC (Sat) by bluss (subscriber, #47454) [Link] (1 responses)

Was Ted's suggestion to change the default data= mode of ext3 a "political" move? It suits him very well, given all the mud he has been throwing on data=ordered

Changing ext3's default data= mode

Posted Apr 26, 2009 7:11 UTC (Sun) by dpotapov (guest, #46495) [Link]

It is changed because data=ordered stutters *horribly*.

http://article.gmane.org/gmane.linux.kernel/818261