Solving the ext3 latency problem
The problem, in short, is this: the ext3 filesystem, when running in the default data=ordered mode, can exhibit lengthy stalls when some process calls fsync() to flush data to disk. This issue most famously manifested itself as the much-lamented Firefox system-freeze problem, but it goes beyond just Firefox. Anytime there is reasonably heavy I/O going on, an fsync() call can bring everything to a halt for several seconds. Some stalls on the order of minutes have been reported. This behavior has tended to discourage the use of fsync() in applications and it makes the Linux desktop less fun to use. It's clearly worth fixing - but nobody did that for years.
When Ted Ts'o looked into the problem, he noticed an obvious problem: data sent to the disk via fsync() is put at the back of the I/O scheduler's queue, behind all other outstanding requests. If processes on the system are writing a lot of data, that queue could be quite long. So it takes a long time for fsync() to complete. While that is happening, other parts of the filesystem can stall, eventually bringing much of the system to a halt.
The first fix was to mark I/O requests generated by fsync() with the WRITE_SYNC operation bit, marking them as synchronous requests. The CFQ I/O scheduler tries to run synchronous requests (which generally have a process waiting for the results) ahead of asynchronous ones (where nobody is waiting). Normally, reads are considered to be synchronous, while writes are not. Once the fsync()-related requests were made synchronous, they were able to jump ahead of normal I/O. That makes fsync() much faster, at the expense of slowing down the I/O-intensive tasks in the system. This is considered to be a good tradeoff by just about everybody involved. (It's amusing to note that this change is conceptually similar to the I/O priority patch posted by Arjan van de Ven some time ago; some ideas take a while to reach acceptance).
Block subsystem maintainer Jens Axboe disliked the change, though, stating that it would cause severe performance regressions for some workloads. Linus made it clear, though, that the patch was probably going to go in, and that, if the CFQ I/O scheduler couldn't handle it, there would soon be a change to a different default scheduler. Jens probably would have looked further in any case, but the extra motivation supplied by Linus is unlikely to have slowed this process down.
The problem, as it turns out, is that WRITE_SYNC actually does two things: putting the request onto the higher-priority synchronous queue, and unplugging the queue. "Plugging" is the technique used by the block layer to issue requests to the underlying disk driver in bursts. Between bursts, the queue is "plugged," causing requests to accumulate there. This accumulation gives the I/O scheduler an opportunity to merge adjacent requests and issue them in some sort of reasonable order. Judicious use of plugging improves block subsystem performance significantly.
Unplugging the queue for a synchronous request can make sense in some situations; if somebody is waiting for the the operation, chances are they will not be adding any adjacent requests to the queue, so there is no point in waiting any longer. As it happens, though, fsync() is not one of those situations. Instead, a call to fsync() will usually generate a whole series of synchronous requests, and the chances of those requests being adjacent to each other is fairly good. So unplugging the queue after each synchronous request is likely to make performance worse. Upon identifying this problem, Jens posted a series of patches to fix it. One of them adds a new WRITE_SYNC_PLUG operation which queues a synchronous write without unplugging the queue. This allows operations like fsync() to create a series of operations, then unplug the queue once at the end.
While he was at it, Jens fixed a couple of related issues. One was that the block subsystem can still sometimes run synchronous requests behind asynchronous requests in some situations. The code here is a bit tricky, since it may be desirable to let a few asynchronous requests through occasionally to prevent them from being starved entirely. Jens changed the balance to ensure that synchronous requests get through in a timely manner.
Beyond that, the CFQ scheduler uses "anticipatory" logic with synchronous requests; upon executing one such request, it will stall the queue to see if an adjacent request shows up. The idea is that the disk head will be ideally positioned to satisfy that request, so the best performance is obtained by not moving it away immediately. This logic can work well for synchronous reads, but it's not helpful when dealing with write operations generated by fsync(). So now there's a new internal flag that prevents anticipation when WRITE_SYNC_PLUG operations are executed.
Linus liked the changes:
It turns out that there's a little more, though. Linus noticed that he was still getting stalls, even if they were much shorter than before, and he wondered why:
The obvious conclusion is that there was still something else going on. Linus's hypothesis was that the volume of requests pending to the drive was large enough to cause stalls even when the synchronous requests go to the front of the queue. With a default configuration, requests can contain up to 512KB of data; stack up a couple dozen or so of those, and it's going to take the drive a little while to work through them. Linus experimented with setting the maximum size (controlled by /sys/block/drive/queue/max_sectors_kb) to 64KB, and reports that things worked a lot better. As of this writing, though, the default has not been changed; Linus suggested that it might make sense to cap the maximum amount of outstanding data, rather than the size of any individual request. More experimentation is called for.
There is one other important change needed to get a truly quick fsync() with ext3, though: the filesystem must be mounted in data=writeback mode. This mode eliminates the requirement that data blocks be flushed to disk ahead of metadata; in data=ordered mode, instead, the amount of data to be written guarantees that fsync() will always be slower. Switching to data=writeback eliminates those writes, but, in the process, it also turns off the feature which made ext3 seem more robust than ext4. Ted Ts'o has mitigated that problem somewhat, though, by adding in the same safeguards he put into ext4. In some situations (such as when a new file is renamed on top of an existing file), data will be forced out ahead of metadata. As a result, data loss resulting from a system crash should be less of a problem.
Sidebar: data=guarded
Another alternative to data=ordered may be the data=guarded mode proposed by Chris Mason. This mode would delay size updates to prevent information disclosure problems. It is a very new patch, though, which won't be ready for 2.6.30.
So Ted suggested that, maybe, data=writeback should be made the default. There was some resistance to this idea; not everybody thinks that ext3, at this stage of its life, should see a big option change like that. Linus, however, was unswayed by the arguments. He merged a patch which creates a configuration option for the default ext3 data mode, and set it to "writeback." That will cause ext3 mounts to silently switch to data=writeback mode with 2.6.30 kernels. Says Linus:
It's worth noting that this default will not change anything if
(1) the data mode is explicitly specified when the filesystem is
mounted, or (2) a different mode has been wired into the filesystem
with tune2fs. It will also be ineffective if distributors change
it back to "ordered" when configuring their kernels. Some distributors, at
least, may well decide that they do not wish to push that kind of change to
their users. We will not see the answer to that question for some months
yet.
Index entries for this article | |
---|---|
Kernel | Filesystems/ext3 |
Posted Apr 14, 2009 16:50 UTC (Tue)
by PO8 (guest, #41661)
[Link]
The argument that "it's a single-user system, so who cares" seems to me to be crazy talk? Like many Linux users I run Apache + database instances that allow anonymous users anywhere on the net to access some of my dynamically-updated files. I may be mistaken, but it looks to me like the current data=writeback mode gives an increased opportunity to disclose things like my database's (foolishly) unencrypted password or my personal email to the whole web after a crash. Not OK, regardless of the shorter sync times.
Posted Apr 14, 2009 16:56 UTC (Tue)
by nye (guest, #51576)
[Link] (3 responses)
IIUC, I suppose there could be things like *new* files being created with zero size after a crash, rather than not being created at all, which doesn't seem like the end of the world. I admit I haven't actually thought this through very much at all yet, so that could be nonsense for both behaviours. :P
What other reliability/data integrity implications would this have?
Posted Apr 14, 2009 18:09 UTC (Tue)
by corbet (editor, #1)
[Link] (2 responses)
data=ordered forces all data to go out before the metadata is written; in practice, that forces data to be written within five seconds. data=guarded, as I understand it, delays the writing of certain metadata (the file size in particular) until the data has been written. The timing is looser, but it still keeps random junk from showing up within a file after a crash.
Posted Apr 14, 2009 18:51 UTC (Tue)
by masoncl (subscriber, #47138)
[Link]
It still does the old style data=ordered in cases where the file size isn't enough protection (like filling holes).
Posted Apr 16, 2009 15:28 UTC (Thu)
by sandeen (guest, #42852)
[Link]
The rename & truncate hacks will help flush some data, but if you, say, untar a kernel tree and crash, those hacks don't come into play.
data=writeback would wind up garbage in the files, data=guarded would wind up with 0-length or shortened files with no garbage, and data=ordered would likely have more files intact due to the journal transaction commits causing more flushing along the way... and no garbage.
Posted Apr 14, 2009 17:14 UTC (Tue)
by MisterIO (guest, #36192)
[Link] (2 responses)
Posted Apr 14, 2009 20:46 UTC (Tue)
by elanthis (guest, #6227)
[Link] (1 responses)
Posted Apr 16, 2009 19:35 UTC (Thu)
by jordanb (guest, #45668)
[Link]
Posted Apr 14, 2009 17:25 UTC (Tue)
by bronson (subscriber, #4806)
[Link] (1 responses)
It's a little disturbing to see data=writeback becoming the default. Is the performance gain really so great that it's worth being less secure by default?
Posted Apr 15, 2009 9:34 UTC (Wed)
by edschofield (guest, #39993)
[Link]
We have _severe_ latency issues with ext3 and our RAID arrays, sometimes causing our servers to appear to freeze for tens of seconds during disk writes. A safer 'writeback' mode that eliminates these latencies will be a huge win for us.
Posted Apr 14, 2009 17:25 UTC (Tue)
by spot (guest, #15640)
[Link]
Posted Apr 14, 2009 17:27 UTC (Tue)
by yusufg (guest, #407)
[Link]
Posted Apr 14, 2009 17:55 UTC (Tue)
by mgb (guest, #3226)
[Link] (14 responses)
We run a bunch of Linux mail servers, and we ain't the only ones.
Maybe somebody should tell Linus how Linux is being used.
Posted Apr 14, 2009 18:11 UTC (Tue)
by malor (guest, #2973)
[Link] (5 responses)
Posted Apr 14, 2009 20:10 UTC (Tue)
by drag (guest, #31333)
[Link] (4 responses)
That is they are logged onto the machine and are doing something on them.. programming, editing, web browsing, etc.
So your dealing with computers with multiple monitors with multiple users logged in at once, or LTSP, or people that still sell shell accounts. All of which are fairly rare compared to personal desktops, embedded systems, or most server systems.
Posted Apr 14, 2009 23:24 UTC (Tue)
by ktanzer (guest, #6073)
[Link] (1 responses)
On a more general note, Linux has inherited rich multi-user capabilities from Unix, and I hate to see that atrophy over time. As one example, it is very easy to find information on how to configure popular programs such as Firefox, KDE, OpenOffice, etc. for a single user, but often maddeningly hard to determine how to configure on a system-wide basis. The fact that FF can't run multiple instances from a single profile is not technically a multi-user issue, but also drives me up the wall...
Posted Apr 15, 2009 1:25 UTC (Wed)
by drag (guest, #31333)
[Link]
I don't know exactly why, but I have a feeling that multiuser systems will become increasingly important in the future.
Something will come along... like the acceptance of IPv6 and the decline of the "Personal Computer"-ing inflicted client-server relationship and the internet will return to it's P2P roots. (for reasons of scalability, robustness, expense). If something like that were to happen and people realized that having mobile computers could become essentially disposible if they turned into little more then terminals for the 'big' computer at home or clusters at work... Then multiuser systems could become common place again.
Weirder things have happenned.
Posted Apr 15, 2009 4:35 UTC (Wed)
by malor (guest, #2973)
[Link]
It would be bad if a bug in the mail server gave access to, say, deleted .htaccess files, or part of a SQL database.
All Unix systems are inherently multiuser, and sabotaging inter-account security features is deliberately cutting away one layer of the net that can catch you if a bug exposes an attack vector.
Posted Apr 15, 2009 5:55 UTC (Wed)
by hawk (subscriber, #3195)
[Link]
Isn't the point that as soon as soon as there are multiple users (no matter if they are logged in using a system account or accessing the system through some other means, eg HTTP, authenticated in some way or even anonymously), there would be a chance that one user's data (or data "belonging to the system") could leak into a file which will be accessible by another user.
So in the case of the the system crashing, a file publicly available on the web, or some logged in user(s), might end up containing anything that has previously been deallocated if that file was being modified, be it by the site administrator, or by a random user on a site where you can for instance upload an image to include in your content.
I would think this definitely falls within the "common usage"-realm for Linux systems and that whoever made the argument may not have really thought it through.
(Or I'm just not understanding in what scenarios something like this could actually happen.)
Posted Apr 14, 2009 18:16 UTC (Tue)
by smoogen (subscriber, #97)
[Link] (1 responses)
Also I think it was the editor not Linus who said that.
Posted Apr 17, 2009 18:21 UTC (Fri)
by giraffedata (guest, #1954)
[Link]
I was going to say the opposite. I think you're making a statement about the number of Linux kernel images running, but I don't think that's a useful measure of prevalence as it relates to the cost of assuming a system is single-user.
On the contrary, I believe the great majority of Linux is multiuser servers and the personal computers and appliances you mention are a blip. I'm looking at the amount of filesystem access that happens.
When I say "multiuser" I'm considering a user to be a person, not a uid.
Incidentally, routers and switches (from your list) are multiuser systems. Consequently, there is a security issue in sending data to the wrong user.
Posted Apr 15, 2009 3:53 UTC (Wed)
by eru (subscriber, #2753)
[Link] (4 responses)
The big company I'm working for actually has most of its interactive Linux users on multi-user servers: This is because everyone is "of course" supplied with a Windows PC, but Linux is preferred for software development for several products, so the developers access Linux servers with X11 emulator or VNC running on the PC. This also makes it easier to maintain a consistent development environment for the users. Some people do have Linux workstations, but these are a minority.
I don't know how typical this kind of use is, but I suspect it common in technology companies needing a Linux development environment for some users but not wanting or being able to go all the way to Linux desktops.
Posted Apr 15, 2009 16:05 UTC (Wed)
by chema (subscriber, #32636)
[Link]
Our development environment is a mixture of Windows desktop PCs (running some development tools + "corporate" applications) and Linux servers (providing: ssh + X11 fwd + http + samba + ...).
It used to be HP-UX/Solaris <-> Windows but we happily migrated to linux a year ago.
Posted Apr 15, 2009 20:18 UTC (Wed)
by PhracturedBlue (subscriber, #4193)
[Link] (2 responses)
Posted Apr 15, 2009 20:29 UTC (Wed)
by mgb (guest, #3226)
[Link]
So yes, we use ext3 for mail servers and web servers etc. All of which are multi-remote-user.
Posted Apr 16, 2009 4:29 UTC (Thu)
by eru (subscriber, #2753)
[Link]
Yes and no: The home directories of the users are normally mounted via NFS (the NFS servers are not always Linux: NetApps and and Solaris boxes are also used), but the Linux servers (usually RHEL) to which people log in use ext3. Because of the various local shared directories, multiuser issues in ext3 are still relevant.
Posted Apr 16, 2009 0:46 UTC (Thu)
by hazelsct (guest, #3659)
[Link]
Another "multi-user" use case on my "single-user" laptop is having a separate account for downloaded software (government contracts sometimes require such things), for which I don't want to take the risk of polluting the rest of my system. I install in wine in this separate account, and share (minimal) data as necessary. Security matters.
[OT: it's astounding to me how well wine runs a *lot* of Windoze software these days!!]
Posted Apr 14, 2009 18:40 UTC (Tue)
by mrshiny (subscriber, #4266)
[Link] (5 responses)
Posted Apr 14, 2009 20:48 UTC (Tue)
by elanthis (guest, #6227)
[Link]
Posted Apr 16, 2009 5:22 UTC (Thu)
by butlerm (subscriber, #13312)
[Link] (3 responses)
Since data=ordered tends to interfere with getting real work done without
Fortunately, there is a lot of room for reasonable, safer relaxations
Posted Apr 16, 2009 9:42 UTC (Thu)
by nye (guest, #51576)
[Link]
Posted Apr 17, 2009 14:03 UTC (Fri)
by anton (subscriber, #25547)
[Link] (1 responses)
Until I get that, I'll just go for data=ordered and hope that the
Linux developers don't break it like they did with data=journal.
Posted Nov 10, 2009 12:00 UTC (Tue)
by schabi (guest, #14079)
[Link]
Posted Apr 16, 2009 15:18 UTC (Thu)
by sandeen (guest, #42852)
[Link] (17 responses)
Actually, you won't have to wait that long for some. As long as data=writeback introduces a security hole by exposing other people's data on a crash[1], Fedora will not be shipping this way. Rawhide has already turned
CONFIG_EXT3_DEFAULTS_TO_ORDERED=y
on. Shipping any other default would be irresponsible.
Posted Apr 16, 2009 19:15 UTC (Thu)
by chad.netzer (subscriber, #4257)
[Link] (13 responses)
Posted Apr 16, 2009 21:15 UTC (Thu)
by dtlin (subscriber, #36537)
[Link] (12 responses)
Posted Apr 16, 2009 22:23 UTC (Thu)
by chad.netzer (subscriber, #4257)
[Link] (11 responses)
BTW, Documentation/filesystems/ext4.txt in current linux repo seems to contradict your statement. I can understand how delayed allocation can affect the situation (since certain data need never be written to the disk at all, even if metadata changes), but for allocated data, how does the ext4 situation differ from ext3 writeback mode?
http://lwn.net/Articles/203915/
Posted Apr 17, 2009 6:52 UTC (Fri)
by bojan (subscriber, #14302)
[Link]
Yeah, confusing isn't it. Relevant part of the docs, diffed: It would be really good if Ted could comment if the above was simply copied from ext3 docs or if it is really still true for ext4 in writeback mode as well.
Posted Apr 18, 2009 16:14 UTC (Sat)
by sbergman27 (guest, #10767)
[Link] (3 responses)
"If you dont need the security guarantees of what happens after a crash that are provided by data=ordered, try using the data=writeback mount option."
Posted Apr 18, 2009 23:22 UTC (Sat)
by bojan (subscriber, #14302)
[Link] (2 responses)
Compare that to this comment: Contradictory, isn't it?
Posted Apr 19, 2009 0:59 UTC (Sun)
by sbergman27 (guest, #10767)
[Link]
On a related note, if he thinks that writeback is good enough for ext3 because, after all, nobody runs Linux with multiple users... then is writeback also destined to be the default for ext4? Or is the idea to destabilize the thus far rock solid ext3 enough to make ext4 look better by comparison?
Posted Apr 19, 2009 2:56 UTC (Sun)
by sitaram (guest, #5959)
[Link]
So I'd see this as "delayed allocation makes ordered almost as efficient as writeback", not "...makes writeback as secure as ordered"
Posted Apr 19, 2009 4:27 UTC (Sun)
by tytso (subscriber, #9993)
[Link] (5 responses)
In the data loss department, if you have an application that didn't use fsync(), and the system crashes, with data=writeback there is the chance for dataloss. In 2.6.30, Linus accepted patches which will cause an implied flush operation when a hueristic detects an application trying to replace an existing file via the replace-via-truncate and replace-via-rename cases patterns. This largely reduces the problems for non-fsync-using applications. It doesn't solve the problem for a freshly written file, but the system could have easily crashed five seconds earlier.
OK, so how does ext4 change things. By default ext4 on modern kernels (ignoring the technology preview on RHEL 5 and Fedora 10) performs delayed allocation. This means that the data blocks are not allocated right away when you write the file, but only when they are forced out, either explicitly via fsync(), or via the page writeback algorithms in the VM, which will tend to push things out after 30-45 seconds (ignoring laptop mode) and perhaps sooner if the system is short on memory.
In the security dimension, what this means is that even in data=writeback mode, in general on a crash the file will be truncated or zero-length instead of containing uninitialized data. In ext4 with delayed allocation and data=writeback, there *is* a very tiny race condition where if a transaction closes right between when the pdflush daemon allocates the filesystem block and before it has a chance to trigger the page writeback, that you might end up with uninitialized garbage. This chance is very small, but it is non-zero. In this case, ext4 data=ordered will force the write to disk, so it is technically safer in the security dimension, although this race is very hard to exploit, and very rare that it gets hit in practice. (This is also why the overhead of data=ordered and data=writeback is much less for ext4, thanks to delayed allocation --- the difference between the two is not the same, however!)
In the safety against applications that don't use fsync department, as of 2.6.30, ext4 will always do an implied allocation and flush for data=ordered and data=writeback. So there is no real difference here between data=ordered and data=writeback.
The bottom line is that while there is some performance benefit in going with data=writeback with ext4, the differences between data=ordered and data=writeback are much smaller with ext4, in both the cost and benefit dimensions.
Chris Mason is also working on a data=guarded mode, which will cause files to be truncated (much like delayed allocation) on a crash with ext3. I will look into porting this mode into ext4, if it proves to be enough of a performance advantage for ext4 over data=ordered, and yet providing a tiny bit more safety than data=writeback. It's not clear to me that it will be worth it for ext4, however.
I hope this helps answers the questions between ext3 and ext4, and data=ordered versus data=writeback.
Regards,
Ted.
Posted Apr 19, 2009 8:07 UTC (Sun)
by sitaram (guest, #5959)
[Link]
I'm one of those people for whom the security aspect is far more important (*) than data loss -- data loss can happen for so many other reasons that one should have a good, reliable, backup regime anyway, so one more reason doesn't bother me.
So ext3: people with my mindset should stick with data=ordered. (I don't see guarded as being too useful for ext3 -- we'll probably have switched to ext4 by the time guarded becomes mainstream).
Ext4: I think I'll stick with ordered here too. If the overhead has been much reduced by delayed alloc, it correspondingly reduces the main advantage of writeback too :-) I'd rather err on the side of security when the difference is minor.
Although collectively we like choice, and we *need* choice, when it comes to actual usage, we have to rationally reduce the many choices available into one and say "*this* is what we will use"!
Thanks once again for jumping in and helping with that!
Sitaram
(*) My home desktop is used by my kids also, for instance -- so it *is* a multi-user machine in the old traditional sense. The work machine runs email and office apps as one user, and my web browser and IRC as another user (simultaneously), so -- while both users are still me -- it too is multi user in the sense of wanting to keep two disparate sets of files separate.
Posted Apr 19, 2009 22:35 UTC (Sun)
by bojan (subscriber, #14302)
[Link]
Posted Apr 19, 2009 22:50 UTC (Sun)
by bojan (subscriber, #14302)
[Link] (2 responses)
Unless there is a significant performance penalty by updating metadata only after the data has been written, instead of having another mode, this is probably how writeback mode should work.
Posted Apr 20, 2009 0:29 UTC (Mon)
by tytso (subscriber, #9993)
[Link] (1 responses)
Unless there is a significant performance penalty by updating metadata only after the data has been written, instead of having another mode, this is probably how writeback mode should work. (Note that data=guarded is only deferring the update of i_size, and not any other form of metadata.) We'll have to benchmark it and see. It does mean that i_size gets updated more, and so that means that the inode has to get updated as blocks are staged out to disk, so that means some extra writes to the journal and inode table. I don't think it should be noticeable, at least for most workloads, since it should be lost in the noise of the data block I/O, but it is extra seeks and extra writes.
Posted Apr 20, 2009 3:43 UTC (Mon)
by bojan (subscriber, #14302)
[Link]
I guess if i_size could be updated just once, when all the blocks are pushed out, then this would be even less of a problem. But, then again, I have no idea how this actually works inside the code, so this suggestion is probably naive.
Posted Aug 5, 2009 1:08 UTC (Wed)
by mdkul (guest, #35333)
[Link] (2 responses)
my tune2fs /dev/sda1 -l says
Default mount options : (none) So what does it default to if it is none?
i *don't* have CONFIG_EXT3_DEFAULTS_TO_ORDERED = y set and my documentation for 2.6.31-rc4 says (Documentation/filesystems/ext3.txt says data=ordered is default.
Thanks in advance.
Posted Aug 5, 2009 14:16 UTC (Wed)
by ABCD (subscriber, #53650)
[Link] (1 responses)
Posted Aug 5, 2009 16:52 UTC (Wed)
by mdkul (guest, #35333)
[Link]
Posted Apr 25, 2009 10:31 UTC (Sat)
by bluss (subscriber, #47454)
[Link] (1 responses)
Posted Apr 26, 2009 7:11 UTC (Sun)
by dpotapov (guest, #46495)
[Link]
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
data=guarded sounds rather interesting, but I'm not sure I understand how it differs from data=ordered. In what situations could the result be different?
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
nowadays it is a blip in the data compared to routers, switches, music boxes, GPS systems, laptops, desktops, workstations that run Linux.
Multiuser quite important still: remote users on Windows PC:s
Multiuser quite important still: remote users on Windows PC:s
Multiuser quite important still: remote users on Windows PC:s
Multiuser quite important still: remote users on Windows PC:s
Yes, but are you using EXT3 as the filesystem on those machines? In many (most?) multi-user systems, you're likely to have a big fileserver serving files via NFS or equivalent to various servers.
Multiuser quite important still: remote users on Windows PC:s
Solving the ext3 latency problem
I don't understand this:
Solving the ext3 latency problem
There is one other important change needed to get a truly quick fsync() with ext3, though: the filesystem must be mounted in data=writeback mode.
Is this because the changes to fsync are disabled in data=ordered, or just because the performance gains are small compared to the overhead of data=ordered?
I'm curious because if fsync is slow application developers won't use it, even if on some systems it's fast. It will be years before application developers start using it "properly" again.
Solving the ext3 latency problem
Solving the ext3 latency problem
than necessary to provide reasonable recovery behavior in most cases. A
strict interpretation of data=ordered means committing dirty data to disk
before any meta data updates. That means that calling fsync on any file
with dirty buffers is equivalent in cost to calling fsync on every file
with dirty buffers in the filesystem.
stalling the question is what kinds of relaxations can be made without
imperiling the integrity of your filesystem. "data=writeback" is the no
holds barred assume your system is never going to crash tough luck for any
recently touched files but you probably won't have to spend hours waiting
for fsck sort of preference.
between data=ordered and data=writeback. data=guarded is one such option
that allows preliminary meta data commits for unrelated files to proceed
with a smaller file size corresponding to the file data that has actually
been written to disk. That works really well as long as you are not trying
to replace an existing file. If you are doing rename replacements the same
problem comes back to haunt you in a way that data=guarded doesn't solve.
(Rename undo would...)
Solving the ext3 latency problem
Solving the ext3 latency problem
A strict interpretation of data=ordered means committing
dirty data to disk before any meta data updates.
I'm not sure I agree, but anyway, if it behaves that way, that's fine
with me. I like my data not only on disk, but also internally
consistent.
"data=writeback" is the no holds barred assume your
system is never going to crash [...] sort of preference.
But if I assume my system is never going to crash, why would I be
using fsync()? And why should a file system that works based on that assumption do anything when the
application calls fsync()?
Fortunately, there is a lot of room for reasonable, safer
relaxations between data=ordered and data=writeback.
I would actually prefer to see something stricter than data=ordered.
Something that gives me the guarantee that the state after a crash
corresponds to some logical state of the file system before the crash.
Solving the ext3 latency problem
I would actually prefer to see something stricter than data=ordered. Something that gives me the guarantee that the state after a crash corresponds to some logical state of the file system before the crash.
You always have the option to mount with "data=journal" - this is the safest and slowest mode with ext3.
And don't forget that RAID5 / RAID6 will break all barrier / journal semantics for all filesystems.
Solving the ext3 latency problem
>> push that kind of change to their users. We will not see the answer to
>> that question for some months yet.
Solving the ext3 latency problem
Solving the ext3 latency problem
ext3's does.
Solving the ext3 latency problem
Solving the ext3 latency problem
* writeback mode
-In data=writeback mode, ext3 does not journal data at all. This mode provides
+In data=writeback mode, ext4 does not journal data at all. This mode provides
a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
mode - metadata journaling. A crash+recovery can cause incorrect data to
appear in files which were written shortly before the crash. This mode will
-typically provide the best ext3 performance.
+typically provide the best ext4 performance.
Solving the ext3 latency problem
Solving the ext3 latency problem
Fundamentally, the problem is caused by data=ordered mode. This problem can be avoided by mounting the filesystem using data=writeback or by using a filesystem that supports delayed allocation such as ext4. This is because if you have a small sqllite database which you are fsync(), and in another process you are writing a large 2 megabyte file, the 2 megabyte file wont be be allocated right away, and so the fsync operation will not force the dirty blocks of that 2 megabyte file to disk; since the blocks havent been allocated yet, there is no security issue to worry about with the previous contents of newly allocated blocks if the system were to crash at that point.
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Solving the ext3 latency problem
Changing ext3's default data= mode
Changing ext3's default data= mode