JLS2009: A Btrfs update
The Btrfs filesystem was merged for the 2.6.29 kernel, mostly as a way to encourage wider testing and development. It is certainly not meant for production use at this time. That said, there are people doing serious work on top of Btrfs; it is getting to where it is stable enough for daring users. Current Btrfs includes an all-caps warning in the Kconfig file stating that the disk format has not yet been stabilized; Chris is planning to remove that warning, perhaps for the 2.6.33 release. Btrfs, in other words, is progressing quickly.
One relatively recent addition is full use of zlib compression. Online resizing and defragmentation are coming along nicely. There has also been some work aimed at making synchronous I/O operations work well.
Defragmentation in Btrfs is easy: any specific file can be defragmented by
simply reading it and writing it back. Since Btrfs is a copy-on-write
filesystem, this rewrite will create a new copy of the file's data which
will be as contiguous as the filesystem is able to make it. This approach
can also be used to control the layout of files on the filesystem. As an
experiment, Chris took a bunch of boot-tracing data from a Moblin system
and analyzed it to figure out which files were accessed, and in which
order. He then rewrote the files in question to put them all in the same
part of the disk. The result was a halving of the I/O time during boot,
resulting in a faster system initialization and smiles all around.
Performance of synchronous operations has been an important issue over the last year. On filesystems like ext3, an fsync() call will flush out a lot of data which is not related to the actual file involved; that adds a significant performance penalty for fsync() use and discourages careful programming. Btrfs has improved the situation by creating an entirely separate Btree on each filesystem which is used for synchronous I/O operations. That tree is managed identically to, but separately from, the regular filesystem tree. When an fsync() call comes along, Btrfs can use this tree to only force out operations for the specific file involved. That gives a major performance win over ext3 and ext4.
A further improvement would be the ability to write a set of files, then flush them all out in a single operation. Btrfs could do that, but there's no way in POSIX to tell the kernel to flush multiple files at once. Fixing that is likely to involve a new system call.
Btrfs provides a number of features which are also available via the device mapper and MD subsystems; some people have wondered if this duplication of features makes sense. But there are some good reasons for it; Chris gave a couple of examples:
- Doing snapshots at the device mapper/LVM layer involves making a lot
more copies of the relevant data. Chris ran an experiment where he
created a 400MB file, created a bunch of snapshots, then overwrote the
file. Btrfs is able to just write the new version, while allowing all
of the snapshots to share the old copy. LVM, instead, copies the data
once for each snapshot. So this test, which ran in less than two
seconds on Btrfs, took about ten minutes with LVM.
- Anybody who has had to replace a drive in a RAID array knows that the rebuild process can be long and painful. While all of that data is being copied, the array runs slowly and does not provide the usual protections. The advantage of running RAID within Btrfs is that the filesystem knows which blocks contain useful data and which do not. So, while an MD-based RAID array must copy an entire drive's worth of data, Btrfs can get by without copying unused blocks.
So what does the future hold? Chris says that the 2.6.32 kernel will
include a version of Btrfs which is stable enough for early adopters to
play with. In 2.6.33, with any luck, the filesystem will have RAID4 and
RAID5 support. Things will then stabilize further for 2.6.34. Chris was
typically cagey when talking about production use, though, pointing out
that it always takes a number of years to develop complete confidence in a
new filesystem. So, while those of us with curiosity, courage, and good
backups could maybe be making regular use of Btrfs within a year,
widespread adoption is likely to be rather farther away than that.
Index entries for this article | |
---|---|
Kernel | Btrfs |
Kernel | Filesystems/Btrfs |
Posted Oct 29, 2009 14:03 UTC (Thu)
by droundy (subscriber, #4559)
[Link] (2 responses)
Isn't this something that could be achieved in an ordinary RAID if
filesystems supported the TRIM feature that has been touted for SSDs?
Does anyone know if this is being worked on?
Posted Oct 29, 2009 20:19 UTC (Thu)
by bronson (subscriber, #4806)
[Link] (1 responses)
Regular hard disks are basically big platters of bits. They don't have any allocation tracking. Because implementing a generic (filesystem-agnostic) trim would require adding another software layer and allocating space to bitmaps, I think it's unlikely the benefits would be worth the complexity.
But who knows! It's a little hard to predict the future of storage right now.
Posted Oct 29, 2009 22:30 UTC (Thu)
by filteredperception (guest, #5692)
[Link]
https://www.redhat.com/archives/dm-devel/2009-October/msg...
Posted Oct 29, 2009 18:26 UTC (Thu)
by Yorick (guest, #19241)
[Link] (37 responses)
Being able to specify dependencies between different changes - don't write this to disk until that change has been committed - would make more in many cases.sense to an application that want to avoid scrambling the user's files but doesn't really care whether a particular update has taken place or not if the system crashes. A barrier would do; full transactions would be wonderful. Nothing really needs to be written to disk as long as the change is eventually done in good order.
Of course we want fast fsync() as well for those servers that are required to send us promises that they've taken care of our data, but far from all applications are like that.
Posted Oct 30, 2009 13:53 UTC (Fri)
by mosfet (guest, #45339)
[Link]
I know one: CouchDB. Beside the hype behind it right now, it's extreme robust append-only storage philosophy was enough to attract attention from the Google Chrome developers. In it's heart this algorithm relies on a fast and secure (as in not cheating) fsync().
CouchDB also has a nice user base, it's installed, used and run by default on Ubuntu 9.10
I guess other database systems also profit from a better fsync().
Posted Oct 30, 2009 16:07 UTC (Fri)
by iq-0 (subscriber, #36655)
[Link] (35 responses)
Luckily that doesn't include particularly common programs like 'vim' or 'firefox', that many people use regularly ;-)
Posted Oct 30, 2009 18:24 UTC (Fri)
by anton (subscriber, #25547)
[Link] (34 responses)
Posted Oct 30, 2009 21:09 UTC (Fri)
by nix (subscriber, #2304)
[Link] (33 responses)
(not relevant for me, battery-backed RAID arrays now hold absolutely
Posted Oct 31, 2009 22:01 UTC (Sat)
by anton (subscriber, #25547)
[Link] (15 responses)
Many people don't want to wait for slow
fsync()s; but if you only want to continue working after the fsync()
has finished, just configure your system to stay with the slow
fsync()s; fine with me.
BTW, your battery-backed RAID arrays will not help you when the
kernel crashes, and the file system decides that it should empty or
zero the files you have worked on when doing the fsck or journal
replay.
Posted Nov 1, 2009 7:51 UTC (Sun)
by Cato (guest, #7643)
[Link] (1 responses)
Posted Nov 1, 2009 19:55 UTC (Sun)
by anton (subscriber, #25547)
[Link]
The former default ext3 behaviour (data=ordered) should also be ok
for simple cases such as this (i.e., no overwriting of existing blocks
involved). Unfortunately, Ted T'so, the current maintainer of ext3
wants to degrade ext3 default functionality to the lowest common
denominator (i.e., at least as bad as UFS), with better functionality
available through mount options; will this work out any better than
the non-default data=journal?
Posted Nov 1, 2009 13:13 UTC (Sun)
by nix (subscriber, #2304)
[Link] (12 responses)
(And I do tend to assume that the journalling layer doesn't have lethal
If fsync() is slow, the problem is that fsync() is slow; the solution is
FWIW I use KDE4 and ext4 and have turned barriers off (battery-backed RAID
(Of course my system doesn't crash often either.)
Posted Nov 1, 2009 20:01 UTC (Sun)
by anton (subscriber, #25547)
[Link] (11 responses)
Posted Nov 1, 2009 20:32 UTC (Sun)
by nix (subscriber, #2304)
[Link] (10 responses)
And I don't really like the idea of calling sync() every second (or every
Being able to fsync() the important stuff *without* forcing everything
Posted Nov 1, 2009 21:37 UTC (Sun)
by anton (subscriber, #25547)
[Link] (9 responses)
For an editor that does not fsync(), that would mean that you lose
a few seconds of work (at worst the few seconds between autosaves plus
the few seconds that the file system delays writing).
For every application (including editors), it would mean that if
the developers ensure the consistency of the persistent data in case
of a process kill, they will also have ensured it in case of a system
crash or power outage. So they do not have to do extra work on
consistency against crashes, which would also be extremely impractical
to test.
It should not be too hard to turn Btrfs into a good file system.
Unfortunately, the Linux file systems seem to regress into the dark
ages (well, the 1980s) when it comes to data consistency (e.g., in the
defaults for ext3). And some things I have read from Chris Mason lead
me to believe that Btrfs will be no better.
As for the guarantees that fsync() gives, it gives no useful
guarantee. It's just a prayer to the file system, and most file
systems actually listen to this prayer in more or less the way you
expect; but some require more prayers than others, and some ignore the
prayer. I wonder why Ted T'so does not apologize for implementing
fsync() in a somewhat useful way instead of the fastest way that still
satisfies the letter of the POSIX specification.
Posted Nov 2, 2009 0:22 UTC (Mon)
by nix (subscriber, #2304)
[Link] (1 responses)
In that case, I agree: editors should not fsync() their autosave state if
And that's why Ted's gone to some lengths to make fsync() fast in ext4:
Posted Nov 2, 2009 21:09 UTC (Mon)
by anton (subscriber, #25547)
[Link]
As for Ted T'so, I would have preferred it if he went to some
lengths to make ext4 a good file system; then fsync() would not be
needed as much. Hmm, makes me wonder if he made fsync() fast because
ext4 is bad, or if he made ext4 bad in order to encourage use of
fsync().
Posted Nov 2, 2009 8:37 UTC (Mon)
by njs (subscriber, #40338)
[Link] (6 responses)
[Citation needed] -- or in other words, if this is so possible, why are no modern filesystem experts working on it, AFAICT? How are you going to be efficient when the requirement you stated requires that arbitrary requests be handled in serial order, forcing you to wait for disk seek latencies?
> I wonder why Ted T'so does not apologize for implementing fsync() in a somewhat useful way instead of the fastest way that still satisfies the letter of the POSIX specification.
Err, why should he apologize for implementing things in a useful way?
Posted Nov 2, 2009 21:34 UTC (Mon)
by anton (subscriber, #25547)
[Link] (5 responses)
Posted Nov 3, 2009 19:56 UTC (Tue)
by nix (subscriber, #2304)
[Link] (4 responses)
I find it odd that one minute you're complaining that filesystems are
If you want the total guarantees you're aiming for, write an FS atop a
Posted Nov 5, 2009 13:49 UTC (Thu)
by anton (subscriber, #25547)
[Link] (3 responses)
If crashes are as irrelevant as you claim, why should anybody use
fsync()? And why are you and Ted T'so agonizing over fsync() speed? Just
turn it into a noop, and it will be fast.
Posted Nov 5, 2009 18:35 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Posted Nov 5, 2009 18:44 UTC (Thu)
by nix (subscriber, #2304)
[Link] (1 responses)
But when the system has just written out my magnum opus, by damn I want
With btrfs, this golden vision of fast fsync() even under high disk write
Again: I'm not interested in fsync() to prevent filesystem corruption
I hope that's clearer :)
Posted Nov 8, 2009 21:53 UTC (Sun)
by anton (subscriber, #25547)
[Link]
OTOH, I care more about data consistency. If we want to combine
these two concerns, we get to some interesting design choices:
Committing the fsync()ed file before earlier writes to other files
would break the ordering guarantee that makes a file system good (of
course, we would only see this in the case of a crash between the time
of the fsync() and the next regular commit). If the file system wants
to preserve the write order, then fsync() pretty much becomes sync(),
i.e., the performance behaviour that you do not want.
One can argue that an application that uses fsync() knows what it
is doing, so it will do the fsync()s in an order that guarantees data
consistency for its data anyway.
Counterarguments: 1) The crash case probably has not been tested
extensively for this application, so it may have gotten the order of
fsync()s wrong and doing the fsync()s right away may compromise the
data consistency after all. 2) This application may interact with
others in a way that makes the ordering of its writes relative to the
others important; committing these writes in a different order opens a
data inconsistency window.
Depending on the write volume of the applications on the machine,
on the trust in the correctness of the fsync()s in all the
applications, and on the way the applications interact with the users,
the following are reasonable choices: 1) fsync() as sync (slowest); 2)
fsync() as out-of-order commit; 3) fsync() as noop.
BTW, I find your motivating example still unconvincing: If you edit
your magnum opus or your tax records, wouldn't you use an editor
that autosaves regularly? Ok, your editor does not fsync() the
autosaves, so with a bad file system you will lose the work, but on a
good file system you won't, so you will also use a good file system
for that, won't you? So it does not really matter for how long you
slaved away on the file, a crash will only lose very little data. Or
if you work in a way that can lose everything, why was the tax records
after 2h59' not important enough to merit more precautions, but after
3h a fast fsync() is more important than anything else?
An example where a synchronous commit is really needed is a remote
"cvs commit" (and maybe similar operations in other version control
systems): Once a file is commited on the remote machine, the file's
version number is updated on the local machine, so the remote commit
should better stay commited, even if the remote machine crashes in the
meantime. Of course, the problem here is that a cvs commit can easily
commit hundreds of files; if it fsync()s every one of them separately,
the cumulated waiting for the disk may be quite noticable. Doing the
equivalent for all the files at once could be faster, but we have no
good way to tell that to the file system (AFAIK CVS works a file at a
time, so it wouldn't matter for CVS, but there may be other
applications where it does). Hmm, if there are few writes by other
applications at the same time, and all the fsync()s were done in the
end, then fsync()-as-sync could be faster than out-of-order fsync()s:
The first fsync() would commit all the files, and the other fsync()s
would just return immediately.
Posted Nov 2, 2009 8:45 UTC (Mon)
by njs (subscriber, #40338)
[Link] (16 responses)
[1] (setq write-region-inhibit-fsync t)
Posted Nov 2, 2009 17:11 UTC (Mon)
by nix (subscriber, #2304)
[Link] (15 responses)
(Did the force-fsync()-to-do-nothing patch ever get lumped into laptop_mode as people were suggesting? I don't have a laptop so I don't follow this sort of thing closely...)
Posted Nov 2, 2009 17:39 UTC (Mon)
by mjg59 (subscriber, #23239)
[Link] (2 responses)
Posted Nov 2, 2009 19:11 UTC (Mon)
by foom (subscriber, #14868)
[Link] (1 responses)
What difference does it make if it's implemented in the kernel or in an LD_PRELOAD library?
Posted Nov 2, 2009 19:17 UTC (Mon)
by mjg59 (subscriber, #23239)
[Link]
Posted Nov 2, 2009 20:39 UTC (Mon)
by njs (subscriber, #40338)
[Link] (11 responses)
It's really *annoying* that firefox/sqlite issue fsync's when storing history information, but I actually find that history information valuable enough that I don't want it all blown away on every crash, and there's really no way to avoid that without fsync.
I would love to see an API that allowed sqlite to express its data integrity requirements without forcing the disk to spin up, but this is not simple: http://www.sqlite.org/atomiccommit.html
Posted Nov 2, 2009 21:58 UTC (Mon)
by anton (subscriber, #25547)
[Link] (10 responses)
There is a different reason for syncing in such applications: A
remote user won't notice that the database server lost power or
crashed right after his transaction went through, so the database
should better ensure that the data is in permanent storage before
reporting completion to remote users.
As for the firefox history, a good file system would be a way to
avoid losing it completely, without requiring fsync().
Posted Nov 2, 2009 23:14 UTC (Mon)
by njs (subscriber, #40338)
[Link] (9 responses)
Posted Nov 3, 2009 23:11 UTC (Tue)
by anton (subscriber, #25547)
[Link] (8 responses)
Posted Nov 4, 2009 0:01 UTC (Wed)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Nov 5, 2009 14:04 UTC (Thu)
by anton (subscriber, #25547)
[Link] (1 responses)
What you write about these figures [citation needed] reminds me of
my experiences with copying
stuff to flash devices. However, no writing to an ext3 file
system was involved there, and I suspect that the problem is sitting at a
lower level than the msdos or vfat file system.
Posted Nov 5, 2009 18:08 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Posted Nov 4, 2009 8:40 UTC (Wed)
by njs (subscriber, #40338)
[Link] (4 responses)
Posted Nov 5, 2009 14:09 UTC (Thu)
by nye (subscriber, #51576)
[Link]
Nevertheless, it's generally a requirement for consistency in the face of application crashes (never mind system crashes or power cuts), unless you want to be dealing with full-blown transactional operations at the application level - which could be very little work if performed using facilities provided by the filesystem, but then wouldn't be portable.
Posted Nov 5, 2009 14:14 UTC (Thu)
by anton (subscriber, #25547)
[Link] (2 responses)
Of course, for applications that overwrite stuff in place (e.g.,
usually data bases) it's not a good file system, and these applications need fsync() with it.
Posted Nov 8, 2009 2:36 UTC (Sun)
by butlerm (subscriber, #13312)
[Link] (1 responses)
Most importantly a high performance filesystem needs to be able to sync the
Ext4 fixes these problems, but either requires an fsync or inserts one to
Posted Nov 8, 2009 22:04 UTC (Sun)
by anton (subscriber, #25547)
[Link]
Posted Oct 30, 2009 0:51 UTC (Fri)
by jackb (guest, #41909)
[Link] (3 responses)
The only way to make this work with filesystem RAID is to create 4 separate
Posted Oct 30, 2009 10:20 UTC (Fri)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Oct 30, 2009 20:12 UTC (Fri)
by jackb (guest, #41909)
[Link] (1 responses)
Posted Oct 30, 2009 21:15 UTC (Fri)
by nix (subscriber, #2304)
[Link]
Posted Oct 30, 2009 16:11 UTC (Fri)
by giraffedata (guest, #1954)
[Link] (4 responses)
I don't follow. LVM doesn't have snapshots of a volume share blocks?
Posted Nov 2, 2009 8:50 UTC (Mon)
by njs (subscriber, #40338)
[Link] (3 responses)
Posted Nov 2, 2009 15:14 UTC (Mon)
by giraffedata (guest, #1954)
[Link] (2 responses)
What I was hoping to get to is whether this difference is an inherent difference between doing snapshots in the filesystem vs in the logical volume. Apparently, it isn't, because LVM could use the same strategy if it wanted to.
Or maybe it's more important in LVM than Btrfs for the original block to stay with the base version?
Posted Nov 8, 2009 3:18 UTC (Sun)
by butlerm (subscriber, #13312)
[Link] (1 responses)
The problem is something like that is probably much too complex for LVM,
However, if the version tracking segment is itself on a typical storage
Posted Nov 8, 2009 11:54 UTC (Sun)
by nix (subscriber, #2304)
[Link]
Posted Oct 30, 2009 16:14 UTC (Fri)
by giraffedata (guest, #1954)
[Link] (3 responses)
Well it doesn't have to be anything fancy like an fsync call with multiple file descriptors. It could be a new kind of fadvise() advice: "this file will be synchronized soon." Do that for every file in the set, then fsync them all one at a time.
Posted Nov 8, 2009 1:27 UTC (Sun)
by butlerm (subscriber, #13312)
[Link] (2 responses)
For example suppose you want to do a write rename replace for a set of
If you are doing this with lots of files, a synchronous commit (or the
Posted Nov 8, 2009 1:35 UTC (Sun)
by giraffedata (guest, #1954)
[Link] (1 responses)
All you need is fsync. Do it on each file in turn, after having done the fadvise on every file. The last fsync will complete at the same time as a single hypothetical "wait on all these files" would.
Posted Nov 8, 2009 2:52 UTC (Sun)
by butlerm (subscriber, #13312)
[Link]
That is still somewhat problematic though, since sync_file_range will not
Anybody who has had to replace a drive in a RAID array knows that the
rebuild process can be long and painful. While all of that data is being
copied, the array runs slowly and does not provide the usual protections.
The advantage of running RAID within Btrfs is that the filesystem knows
which blocks contain useful data and which do not. So, while an MD-based
RAID array must copy an entire drive's worth of data, Btrfs can get by
without copying unused blocks.
JLS2009: A Btrfs update
JLS2009: A Btrfs update
JLS2009: A Btrfs update
Fast fsync() is of course welcome, but only really needed by some applications. The ability to guarantee that a change has been committed to permanent storage (before replying to a network request, say) is nice, but even when optimised in the way the article suggests, likely to be unnecessarily expensive when such a guarantee isn't required.
JLS2009: A Btrfs update
JLS2009: A Btrfs update
JLS2009: A Btrfs update
Vim and firefox don't need fsync(), they just use it, because it's
apparently the most portable way of praying for consistency across
crashes. A good file system guarantees POSIX logical ordering
across crashes, and there applications like vim or firefox do not need
fsync(). If the computer is used only for such applications (not, e.g.,
remote transactional systems), then the sysadmin could use it in a
mode where fsync() is a noop, and it would be really fast. Let's hope
that Btrfs is headed that way.
JLS2009: A Btrfs update
JLS2009: A Btrfs update
thank you very much. I don't want the OS deciding to hold on to it for 30s
or 300s or whatever before flushing it back. It's not keeping the FS
consistent across crashes I care about: it's preserving *the stuff I just
saved* across crashes!
everything I care about at home and at work, bwahaha)
I don't mind losing a few seconds of my work on a crash, if I learn
about the crash right away (as mentioned, remote servers can be a
different issue). I do mind it if the file system loses an hour of my
work, as has
happened to me and as the ext4 author Ted T'so believes file
systems should behave.
JLS2009: A Btrfs update
JLS2009: A Btrfs update
Yes, data=journal should be ok, unless they introduce one of the file
system corruption bugs like one I read about (for data=journal) some
years ago. I guess this was not noticed during development because
it's a non-default mode, so it's tested by few (and typically those
people who do use such hopefully-safer, slower features don't run bleeding-edge
kernels).
JLS2009: A Btrfs update
JLS2009: A Btrfs update
course battery-backed disk storage doesn't help if in the absence of
fsync() the OS hasn't pushed any data anywhere near said storage yet!
data-eating bugs, or at least none that bite me. Should they do so, well,
that sort of rare disaster is what backups are for.)
to speed it up, not rip the calls out of things like your text editor. (FF
using fsync() for transient-but-bulky state like the awesome bar is nuts,
agreed.)
array, again) and have had not a single instance of sudden death by
zeroing. So it doesn't happen to everyone.
An fsync() that really synchronously writes to the disk is always
going to be slow, because it lets the program wait for the disk(s).
And with a good file system it's completely unnecessary for an
application like an editor; editors just call it as a workaround for
bad file systems.
JLS2009: A Btrfs update
JLS2009: A Btrfs update
second? I can't see any way in which you could get the guarantees fsync()
does for files you really care about without paying some kind of price for
it in latency for those files.
five, thank you ext3).
else to disk, like btrfs promises, seems very nice. Now my editor files
can be fsync()ed without also requiring me to wait for a few hundred Mb of
who-knows-what breadcrumb crud from FF to also be synced.
In a good file system, the state after recovery is the logical state
of the file system of some point in time (typically a few seconds)
before the crash. It's possible to implement that efficiently
(especially in a copy-on-write file system).
JLS2009: A Btrfs update
JLS2009: A Btrfs update
almost every keystroke plus a filesystem that preserves *some* consistent
state, but not necessarily the most recent one.
they're preserving it every keystroke or so (and the filesystem should not
destroy the contents of the autosave file: thankfully neither ext3 nor
ext4 do so, now that they recognize rename() as implying a
block-allocation ordering barrier). But I certainly don't agree that
editors shouldn't fsync() files *when you explicitly asked it to save
them*! No, I don't think it's acceptable to lose work, even a few seconds'
work, after I tell an editor 'save now dammit'. That's what 'save'
*means*.
because he wants people to actually *use* it.
Using fsync() does not prevent losing a bit of work when you press
save, because the system can crash between the time when you hit save
and when the application actually calls and completes fsync(). The only thing that
fsync() buys is that the save takes longer, and once it's finished and
the application lets you work again, you won't lose the few seconds.
That may be worth the cost for you, but I wonder why?
JLS2009: A Btrfs update
JLS2009: A Btrfs update
JLS2009: A Btrfs update
[Citation needed]
Yes, I have wanted to write this down for some time. Real soon now,
promised!
if this is so possible, why are no modern filesystem experts working
on it, AFAICT?
Maybe they are, or they consider it a solved problem and have moved on
to other challenges. As for those file system experts that we read
about on LWN (e.g., Ted T'so), they are not modern as far as data
consistency is concerned, instead they are regressing to the 1980s.
And they are so stuck in that mindset that they don't see the need for
something better. Probably something like: "Sonny, when we were
young, we did not need data consistency from the file system; and if
fsync() was good enough for us, it's certainly good enough for you!".
How are you going to be efficient when the requirement you
stated requires that arbitrary requests be handled in serial order,
It doesn't. All the changes between two commits can be written out in
arbitrary order, only the commit has to come after all these
writes.
Err, why should [Ted T'so] apologize for implementing
things in a useful way?
He has done so before.
JLS2009: A Btrfs update
for doing things right.
useless because problems occur if you don't fsync(), then the next moment
you're complaining that it's too slow, then the next moment you're
complaining about the precise opposite.
relational database. You *will* experience an enormous slowdown. This is
why all such filesystems (and there have been a few) have tanked: crashes
are rare enough that basically everyone is willing to trade off the chance
of a little rare corruption against a huge speedup all the time. (I can't
remember the time I last had massive filesystem corruption due to power
loss or system crashes. I've had filesystem corruption due to buggy drive
firmware, and filesystem corruption due to electrical storms... but
neither of these would be cured by your magic all-consistent filesystem,
because in both cases the drive wasn't writing what it was asked to write.
And *that* is more common than the sort of thing you're agonizing over. In
fact it seems to be getting more common all the time.)
I understood Ted T'so's apology as follows: He thinks that application
should use fsync() in lots of places, and by contributing to a better
file system where that is not necessary as much, application
developers were not punished by the file system as they should be in
his opinion, and he apologized for spoiling them in this way.
JLS2009: A Btrfs update
I find it odd that one minute you're complaining that filesystems are
useless because problems occur if you don't fsync(), then the next moment
you're complaining that it's too slow, then the next moment you're
complaining about the precise opposite.
Are you confusing me with someone else, are you trying to put up a
straw man, or was my position so hard to understand? Anyway, here it
is again:
Your relational database file system is a straw man; I hope you
had good fun beating it up.
JLS2009: A Btrfs update
like that. Sorry.
JLS2009: A Btrfs update
that are going on is the system chewing to itself, all you need is
consistency across crashes.
that to hit persistent storage right now! The fsync() should bypass all
other disk I/O as much as possible and hit the disk absolutely as fast as
it can: slowing to disk speeds is fine, we're talking human reaction time
here which is much slower: I don't care if writing out my tax records
takes five seconds 'cos I just spent three hours slaving over them, five
seconds is nothing. But waiting behind a vast number of unimportant writes
(which were all asynchronous until our fsync() forced them out because of
filesystem infelicities) is not fine: if we have to wait for minutes for
our stuff to get out, we may as well have done an asynchronous write.
load is possible. With ext*, it mostly isn't (you have to force earlier
stuff to the disk even if I don't give a damn about it and nobody ever
fsync()ed it), and in ext3 without data=writeback, fsync() is so slow when
contending with write loads that app developers were tempted to drop this
whole requirement and leave my magnum opus hanging about in transient
storage for many seconds. With ext4 at least fsync() doesn't stall my apps
merely because bloody firefox decided to drop another 500Mb hairball.
(that mostly doesn't happen, thanks to the journal, even if the power
suddenly goes out). I'm interested in saving *the contents of particular
files* that I just saved. If you're writing a book, and you save a
chapter, you care much more about preserving that chapter in case of power
fail than you care about some random FS corruption making off
with /usr/bin; fixing the latter is one reinstall away, but there's
nothing you can reinstall to get your data back.
Sure, if the only thing you care about in a file system is that
fsync()s complete quickly and still hit the disk, use a file system
that gives you that.
JLS2009: A Btrfs update
JLS2009: A Btrfs update
JLS2009: A Btrfs update
JLS2009: A Btrfs update
it assumes it knows better than applications - the applications should either change behaviour
themselves, or have an LD_PRELOADed library that makes fsync() behaviour conditional on battery
state.
JLS2009: A Btrfs update
intentionally to break fsync...
JLS2009: A Btrfs update
JLS2009: A Btrfs update
JLS2009: A Btrfs update
But I don't want fsync() to do nothing at all, because
there are lots of cases where a poorly-timed crash can cause you to
lose not 10 minutes of work, but your entire data store. This applies
to basically anything using a more complex data storage strategy than
"rewrite the entire data store every time", e.g. dbm, sqlite,
databases generally.
If these applications don't corrupt their storage when they crash on
their own or are killed, they won't corrupt it on a good file system
even on a system crash. So it's only on bad file systems where the
absence of fsync() would cause consistency problems. And how can you
be sure that the fsync()s called from these applications are
sufficient? Testing this stuff is pretty hard.
JLS2009: A Btrfs update
I think that ext3 with data=journal or data=ordered is pretty close to
a good file system for applications that don't overwrite files in
place (e.g., editors). But I would be more confident if some file
system developer actually made data consistency a design goal and gave
some explicit guarantees.
JLS2009: A Btrfs update
JLS2009: A Btrfs update
care at all about either read or write speed. The latency figures Linus
posted (from one process dd(1)ing and another writing tiny files and
fsync()ing them) are appalling. We're not talking a mere few seconds,
we're talking over a minute at times.
ext3 with data=ordered is fast enough in my experience (which includes
several multi-user servers).
JLS2009: A Btrfs update
JLS2009: A Btrfs update
the per-bdi writeback fix should solve. I saw it back in the days before
cheap USB hard drives, when I ran backups onto pcdrw...
JLS2009: A Btrfs update
JLS2009: A Btrfs update
Most applications don't even append, they just write a new file in one go (and
some then rename it, unlinking the old one). I think that ext3 data=ordered
is a good file system for these applications.
JLS2009: A Btrfs update
JLS2009: A Btrfs update
slow for a number of important use cases.
data of one file independent of all the pending data for every other open
file. That is the whole problem with ext3 - it doesn't do that, so an fsync
under competing write load is very slow.
make a rename replacement an atomic operation. That delay could be avoided
with some reasonable internal modifications (keeping the old inode around
until the new inode's data commits, and then undoing the rename if necessary
on journal recovery), but I am not aware of any filesystem that actually does
that. You have to call fsync to make your code portable anyway, but there
are a number of applications where that is too expensive.
I don't see that fsync() makes my code (or anyone else's) portable.
POSIX gives no useful guarantees on fsync(); different file systems
have different requirements for what you have to fsync() in order to
really commit a file. So use of fsync() is inherently non-portable.
JLS2009: A Btrfs update
JLS2009: A Btrfs update
encryption too. Right now I can take 4 hard drives and combine them into one
RAID block device, install LUKS on that block device and add file systems on
top of that.
encrypted disks and enter 4 passphrases every time the system boots.
JLS2009: A Btrfs update
JLS2009: A Btrfs update
JLS2009: A Btrfs update
JLS2009: A Btrfs update
Doing snapshots at the device mapper/LVM layer involves making a lot more copies of the relevant data. Chris ran an experiment where he created a 400MB file, created a bunch of snapshots, then overwrote the file. Btrfs is able to just write the new version, while allowing all of the snapshots to share the old copy. LVM, instead, copies the data once for each snapshot.
JLS2009: A Btrfs update
OK, I see. When you write to the base version, LVM copies the original data to a new block for each existing snapshot, then updates the original block. Btrfs instead writes the new data to a new block for the base version and leaves the snapshots pointing to the original block.
JLS2009: A Btrfs update
JLS2009: A Btrfs update
to make read only snapshots of filesystems (hence the lawsuit). NetApp uses
the same scheme to make read only snapshots of virtual block devices as well.
comparable in complexity to BTRFS itself. So for LVM to avoid the copy
before write problem, presumably it would have to use a scheme where the
physical locations of one or more versions of each block are stored in an
persistent segment somewhere.
device, every random write to something that a snapshot has been taken of
requires both a write to a new block on the disk and a write to the version
pointer entry. Short of locating the segment in NVRAM or a more reliable
than average flash device that is a bit of a problem.
JLS2009: A Btrfs update
array dirty state, one write could be amplified to, what, six? (of course
you don't get a superblock update with every write unless writes are quite
rare... but often writes *are* rare.)
JLS2009: A Btrfs update
but there's no way in POSIX to tell the kernel to flush multiple files at once. Fixing that is likely to involve a new system call.
JLS2009: A Btrfs update
had the semantics of "wait on all the pseudo-synchronous fsync operations
that were just initiated". Otherwise the semantics wouldn't be fsync like
at all.
files. On many filesystems, the rename meta data operation will commit
before the data from the previous write commits, so the only safe way to do
this is fsync the new version before calling rename. Otherwise, on a crash
you may get no version at all, not the old version, not the new version,
just a zero length file.
equivalent) of the data for the whole group prior to the renames for the
whole group is the only efficient way to go. Short of that you would need
to spawn a large number of threads, issue fsync rename operations in each
one and wait for them all to finish.
JLS2009: A Btrfs update
There would also need to be synchronous fadvise call or the equivalent that
had the semantics of "wait on all the pseudo-synchronous fsync operations
that were just initiated"
JLS2009: A Btrfs update
over serial fsyncs alone. I think you can more or less do the same thing now
on Linux with sync_file_range(...,SYNC_FILE_RANGE_WRITE). Without additional
flags that schedules asynchronous write out of the specified part of the
file. Then when you are all done, call fsync on every fd in the list, as you
say.
initiate write out of the metadata, which could be significant. Depending on
the way the filesystem handles metadata you could have a very similar
problem, with a journal write and synchronous wait for every fsync...So
something like fadvise options that schedules data and/or metadata for
immediate writeout would be helpful there.