Improving ext4: bigalloc, inline data, and metadata checksums

By Jonathan Corbet
November 29, 2011

It may be tempting to see ext4 as last year's filesystem. It is solid and reliable, but it is based on an old design; all the action is to be found in next-generation filesystems like Btrfs. But it may be a while until Btrfs earns the necessary level of confidence in the wider user community; meanwhile, ext4's growing user base has not lost its appetite for improvement. A few recently-posted patch sets show that the addition of new features to ext4 has not stopped, even as that filesystem settles in for a long period of stable deployments.

Bigalloc

In the early days of Linux, disk drives were still measured in megabytes and filesystems worked with blocks of 1KB to 4KB in size. As this article is being written, terabyte disk drives are not quite as cheap as they recently were, but the fact remains: disk drives have gotten a lot larger, as have the files stored on them. But the ext4 filesystem still deals in 4KB blocks of data. As a result, there are a lot of blocks to keep track of, the associated allocation bitmaps have grown, and the overhead of managing all those blocks is significant.

Raising the filesystem block size in the kernel is a dauntingly difficult task involving major changes to memory management, the page cache, and more. It is not something anybody expects to see happen anytime soon. But there is nothing preventing filesystem implementations from using larger blocks on disk. As of the 3.2 kernel, ext4 will be capable of doing exactly that. The "bigalloc" patch set adds the concept of "block clusters" to the filesystem; rather than allocate single blocks, a filesystem using clusters will allocate them in larger groups. Mapping between these larger blocks and the 4KB blocks seen by the core kernel is handled entirely within the filesystem.

The cluster size to use is set by the system administrator at filesystem creation time (using a development version of e2fsprogs), but it must be a power of two. A 64KB cluster size may make sense in a lot of situations; for a filesystem that holds only very large files, a 1MB cluster size might be the right choice. Needless to say, selecting a large cluster size for a filesystem dominated by small files may lead to a substantial amount of wasted space.

Clustering reduces the space overhead of the block bitmaps and other management data structures. But, as Ted Ts'o documented back in July, it can also increase performance in situations where large files are in use. Block allocation times drop significantly, but file I/O performance also improves in general as the result of reduced on-disk fragmentation. Expect this feature to attract a lot of interest once the 3.2 kernel (and e2fsprogs 1.42) make their way to users.

Inline data

An inode is a data structure describing a single file within a filesystem. For most filesystems, there are actually two types of inode: the filesystem-independent in-kernel variety (represented by struct inode), and the filesystem-specific on-disk version. As a general rule, the kernel cannot manipulate a file in any way until it has a copy of the inode, so inodes, naturally, are the focal point for a lot of block I/O.

In the ext4 filesystem, the size of on-disk inodes can be set when a filesystem is created. The default size is 256 bytes, but the on-disk structure (struct ext4_inode) only requires about half of that space. The remaining space after the ext4_inode structure is normally used to hold extended attributes. Thus, for example, SELinux labels can be found there. On systems where extended attributes are not heavily used, the space between on-disk inode structures may simply go to waste.

Meanwhile, space for file data is allocated in units of blocks, separately from the inode. If a file is very small (and, even on current systems, there are a lot of small files), much of the block used to hold that file will be wasted. If the filesystem is using clustering, the amount of lost space will grow even further, to the point that users may start to complain.

Tao Ma's ext4 inline data patches may change that situation. The idea is quite simple: very small files can be stored directly in the space between inodes without the need to allocate a separate data block at all. On filesystems with 256-byte on-disk inodes, the entire remaining space will be given over to the storage of small files. If the filesystem is built with larger on-disk inodes, only half of the leftover space will be used in this way, leaving space for late-arriving extended attributes that would otherwise be forced out of the inode.

Tao says that, with this patch set applied, the space required to store a kernel tree drops by about 1%, and /usr gets about 3% smaller. The savings on filesystems where clustering is enabled should be somewhat larger, but those have not yet been quantified. There are a number of details to be worked out yet - including e2fsck support and the potential cost of forcing extended attributes to be stored outside of the inode - so this feature is unlikely to be ready for inclusion before 3.4 at the earliest.

Metadata checksumming

Storage devices are not always as reliable as we would like them to be; stories of data corrupted by the hardware are not uncommon. For this reason, people who care about their data make use of technologies like RAID and/or filesystems like Btrfs which can maintain checksums of data and metadata and ensure that nothing has been mangled by the drive. The ext4 filesystem, though, lacks this capability.

Darrick Wong's checksumming patch set does not address the entire problem. Indeed, it risks reinforcing the old jest that filesystem developers don't really care about the data they store as long as the filesystem metadata is correct. This patch set seeks to achieve that latter goal by attaching checksums to the various data structures found on an ext4 filesystem - superblocks, bitmaps, inodes, directory indexes, extent trees, etc. - and verifying that the checksums match the data read from the filesystem later on. A checksum failure can cause the filesystem to fail to mount or, if it happens on a mounted filesystem, remount it read-only and issue pleas for help to the system log.

Darrick makes no mention of any plans to add checksums for data as well. In a number of ways, that would be a bigger set of changes; checksums are relatively easy to add to existing metadata structures, but an entirely new data structure would have to be added to the filesystem to hold data block checksums. The performance impact of full-data checksumming would also be higher. So, while somebody might attack that problem in the future, it does not appear to be on anybody's list at the moment.

The changes to the filesystem are significant, even for metadata-only checksums, but the bulk of the work actually went into e2fsprogs. In particular, e2fsck gains the ability to check all of those checksums and, in some cases, fix things when the checksum indicates that there is a problem. Checksumming can be enabled with mke2fs and toggled with tune2fs. All told, it is a lot of work, but it should help to improve confidence in the filesystem's structure. According to Darrick, the overhead of the checksum calculation and verification is not measurable in most situations. This feature has not drawn a lot of comments this time around, and may be close to ready for inclusion, but nobody has yet said when that might happen.

Index entries for this article
Kernel	Filesystems/ext4

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 29, 2011 23:44 UTC (Tue) by pr1268 (subscriber, #24648) [Link] (103 responses)

> It is solid and reliable

I'm not so sure about that; I've suffered data corruption in a stand-alone ext4 filesystem with a bunch of OGG Vorbis files—occasionally ogginfo(1) reports corrupt OGG files. Fortunately I have backups.

I'm going back to ext3 at the soonest opportunity. FWIW I'm using a multi-disk LVM setup—I wonder if that's the culprit?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 29, 2011 23:49 UTC (Tue) by yoe (guest, #25743) [Link] (10 responses)

What you're trying to do with moving back to ext3 is what the Jargon File calls shotgun debugging: trying out some radical move in hopes that this will fix your problem.

Try to nail down whether your problem is LVM, one of your disks dying, or ext4, before changing things like that. Otherwise you'll be debugging for a long time to come...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 4:49 UTC (Wed) by ringerc (subscriber, #3071) [Link] (8 responses)

... or bad RAM, bad CPU cache, CPU heat, an unrelated kernel bug in (eg) a disk controller, a disk controller firmware bug, a disk firmware bug, or all sorts of other exciting possibilities.

I recently had a batch of disks in a backup server start eating data because of a HDD firmware bug. It does happen.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 8:29 UTC (Wed) by hmh (subscriber, #3838) [Link]

Please disclose disc model and fw level, this kind of stuff is important as it often helps someone avoid data loss...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 12:02 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (6 responses)

"occasionally ogginfo(1) reports corrupt OGG files"

Screams RAM or cache fault to me. It's that word "occasionally" which does it. Bugs tend to be systematic. Their symptoms may be bizarre, but there's usually something consistent about them, because after all someone has specifically (albeit accidentally) programmed the computer to do exactly whatever it was that happened. Even the most subtle Heisenbug will have some sort of pattern to it.

Yoe should be especially suspicious of their "blame ext4" idea if this "corruption" is one or two corrupted bits rather than big holes in the file. Disks don't tend to lose individual bits. Disk controllers don't tend to lose individual bits. Filesystems don't tend to lose individual bits. These things all deal in blocks, when they lose something they will tend to lose really big pieces.

But dying RAM, heat-damaged CPU cache, or a serial link with too little margin of error, those lose bits. Those are the places to look when something mysteriously becomes slightly corrupted.

Low-level network protocols often lose bits. But because there are checksums in so many layers you won't usually see this in a production system even when someone has goofed (e.g. not implemented Ethernet checksums at all) because the other layers act as a safety net.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 12:44 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

Bah, should specify pr1268 not Yoe. Sorry.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 15:42 UTC (Wed) by pr1268 (subscriber, #24648) [Link] (4 responses)

The corruption I was getting was not merely "one or two bits" but rather a hole in the OGG file big enough to cause an audible "skip" in the playback—large enough to believe it was a whole block disappearing from the filesystem. Also, the discussion of write barriers came up; I have noatime,data=ordered,barrier=1 as mount options for this filesystem in my /etc/fstab file—I'm pretty sure those are the "safe" defaults (but I could be wrong).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 17:31 UTC (Wed) by rillian (subscriber, #11344) [Link] (3 responses)

Ogg files have block-level checksums too.

That means that a few bit errors will cause the decoder to drop ~100 ms of audio at a time, and tools will report this as 'hole in data'. To see if it's disk or filesystem corruption, look for pages of zeros in a hexdump around where the glitch is.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 3:17 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (2 responses)

Maybe ogg's block-level checksums aren't such a good idea after all. Most likely, a few wrong bits won't affect the sound output much, and a 100ms skip sounds much worse than just playing a single wrong sample. Checksums make sense for things that must stay intact, but I don't think most multimedia benefits from this kind of robustness.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 10:07 UTC (Thu) by mpr22 (subscriber, #60784) [Link]

A few wrong bits in a Vorbis stream seem likely to give you more than just "one wrong sample".

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 18:25 UTC (Thu) by rillian (subscriber, #11344) [Link]

Indeed. A few corrupt bits in a compressed format can result in a whole block of nasty noise in the output.

The idea with the Ogg checksums was to protect the listener's ears (and possibly speakers) from corrupt output. It's also nice to have a built-in check for data corruption in your archives, which is working as designed here.

What you said is valid for video, because we're more tolerant of high frequency visual noise, and because the extra data dimensions and longer prediction intervals mean you can get more useful information from a corrupt frame than you do with audio. Making the checksum optional for the packet data is one of the things we'd do if we ever revised the Ogg format.

shotgun debugging

Posted Dec 2, 2011 22:09 UTC (Fri) by giraffedata (guest, #1954) [Link]

What you're trying to do with moving back to ext3 is what the Jargon File calls shotgun debugging: trying out some radical move in hopes that this will fix your problem.

That's not shotgun debugging (and not what the Jargon File calls it). The salient property of a shotgun isn't that it makes radical changes, but that it makes widespread changes. So you hit what you want to hit without aiming at it.

Shotgun debugging is trying lots of little things, none that you particularly believe will fix the bug.

In this case, the fallback to ext3 is fairly well targeted: the problem came contemporaneously with this one major and known change to the system, so it's not unreasonable to try undoing that change.

The other comments give good reason to believe this is not the best way forward, but it isn't because it's shotgun debugging.

There must be a term for the debugging mistake in which you give too much weight to the one recent change you know about in the area; I don't know what it is. (I've lost count of how many people accused me of breaking their Windows system because after I used it, there was a Putty icon on the desktop and something broke soon after that).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 0:00 UTC (Wed) by bpepple (subscriber, #50705) [Link] (17 responses)

Depending on how you encoded those ogg files, the corruption you're seeing might not be due to ext4 but to this bug(1).

1. https://bugzilla.redhat.com/show_bug.cgi?id=722667

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 0:32 UTC (Wed) by pr1268 (subscriber, #24648) [Link] (16 responses)

Thanks for the pointer, and thanks also to yoe's reply above. But, my music collection (currently over 10,000 files) has existed for almost four years, ever since I converted the entire collection from MP3 to OGG (via a homemade script which took about a week to run).¹ (I've never converted from FLAC to OGG, although I do have a couple of FLAC files.) I never noticed any corruption in the OGG files until a few months ago, shortly after I did a clean OS re-install (Slackware 13.37) on bare disks (including copying the music files)². I'm all too eager to blame the corruption on ext4 and/or LVM, since those were the only two things that changed immediately prior to the corruption, but you both bring up a good point that maybe I should dig a little deeper into finding the root cause before I jump to conclusions.

¹ I've had this collection of (legitimately acquired) songs for years prior, even having it on NTFS back in my Win2000/XP days. I abandoned Windows (including NTFS) in August 2004, and my music collection was entirely MP3 format (at 320 kbit) since I got my first 200GB hard disk. After seeing the benefits of the OGG Vorbis format, I decided to switch.

² I have four physical disks (volumes) in which I've set up PV set spanning across all disks for fast I/O performance. I'm not totally impressed at the performance—it is somewhat faster—but that's a whole other discussion.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 0:57 UTC (Wed) by yokem_55 (guest, #10498) [Link] (9 responses)

I would also run smartctl -l error to see if your hard drives are bugging out and maybe even run memtest86+ overnight to see if you are having memory errors. Wierd, widespread data (with metadata intact) corruption in my experience tends to be more hardware related than anything else.

ext4 experience

Posted Nov 30, 2011 2:11 UTC (Wed) by dskoll (subscriber, #1630) [Link] (5 responses)

I also had a very nasty experience with ext4. A server I built using ext4 suffered a power failure and the file system was completely toast after it powered back up. fsck threw hundreds of errors and I ended up rebuilding from scratch.

I have no idea if ext4 was the cause of the problem, but I've never seen that on an ext3 system. I am very nervous... possibly irrationally so, but I think I'll stick to ext3 for now.

ext4 experience

Posted Nov 30, 2011 4:52 UTC (Wed) by ringerc (subscriber, #3071) [Link] (3 responses)

The usual culprit in those sorts of severe corruption or loss cases is aggressive write-back caching without battery backup. Some cheap RAID controllers will let you enable write-back caching without a BBU, and some HDDs support it too.

Write-back caching on volatile storage without careful use of write barriers and forced flushes *will* cause severe data corruption if the storage is cleared due to (eg) unexpected power loss.

ext4 experience

Posted Nov 30, 2011 9:00 UTC (Wed) by Cato (guest, #7643) [Link] (2 responses)

You are right about battery backup. Every modern hard disk uses writeback caching, and some of them make it hard to ensure that the cache is flushed when the kernel wants to ensure a write barrier is implemented. The size of hard disk caches (32 MB typically) and the use of journalling filesystems (concentrating key metadata writes in journal blocks) can mean that a power loss or hard crash loses a large amount of filesystem metadata.

ext4 experience

Posted Nov 30, 2011 12:40 UTC (Wed) by dskoll (subscriber, #1630) [Link] (1 responses)

My system was using Linux Software RAID, so there wasn't a cheap RAID controller in the mix. You could be correct about the hard drives doing caching, but it seems odd that I've never seen this with ext3 but did with ext4. I am still hoping it was simply bad luck, bad timing, and writeback caching... but I'm also still pretty nervous.

ext4 experience

Posted Nov 30, 2011 12:50 UTC (Wed) by dskoll (subscriber, #1630) [Link]

Ah... reading http://serverfault.com/questions/279571/lvm-dangers-and-caveats makes me think I was a victim of LVM and no write barriers. I've followed the suggestions in that article. So maybe I'll give ext4 another try.

ext4 experience

Posted Nov 30, 2011 20:20 UTC (Wed) by walex (subscriber, #69836) [Link]

You have been wishing for O_PONIES!

It is a very well known issue usually involving unaware sysadms and cheating developers.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 2:13 UTC (Wed) by nix (subscriber, #2304) [Link] (2 responses)

Quite so. I've been using ext4 atop LVM (atop raid1 md, raid5 md, and Areca hardware RAID) for many years, and have never encountered a single instance of fs corruption which fsck could not repair -- and only one severe enough to prevent mounting which was not attributable to abrupt powerdowns, and *that* was caused by a panic at the end of a suspend, and e2fsck fixed it.

I'm quite willing to believe that bad RAM and the like can cause data corruption, but even when I was running ext4 on a machine with RAM so bad that you couldn't md5sum a 10Mb file three times and get the same answer thrice, I had no serious corruption (though it is true that I didn't engage in major file writing while the RAM was that bad, and I did get the occasional instances of bitflips in the page cache, and oopses every day or so).

bitflips

Posted Nov 30, 2011 12:49 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (1 responses)

"occasional instances of bitflips in the page cache"

To someone who isn't looking for RAM/ cache issues as the root cause, those often look just like filesystem corruption of whatever kind. They try to open a file, get an error saying it's corrupted. Or they run a program and it mysteriously crashes.

If you _already know_ you have bad RAM, then you say "Ha, bitflip in page cache" and maybe you flush a cache and try again. But if you've already begun to harbour doubts about Seagate disks, or Dell RAID controllers, or XFS then of course that's what you will tend to blame for the problem.

bitflips

Posted Dec 1, 2011 19:23 UTC (Thu) by nix (subscriber, #2304) [Link]

This does depend on how bad the RAM was. The RAM on this machine was so bad that the fs was not the only thing misbehaving by any means.

Rare bitflips are normally going to be harmless or fixed up by e2fsck, one would hope. There may be places where a single bitflip, written back, toasts the fs, but I'd hope not. (The various fs fuzzing tools would probably have helped comb those out.)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 10:19 UTC (Wed) by Trou.fr (subscriber, #26289) [Link] (5 responses)

Not related to the current discussion : I hope you are aware that transcoding your MP3 collection to Vorbis only decreased their audio quality : http://wiki.hydrogenaudio.org/index.php?title=Transcoding

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 15:35 UTC (Wed) by pr1268 (subscriber, #24648) [Link] (4 responses)

From that article: Mp3 to Ogg Ogg -q6 was required to achieve transparency against the (high-quality) mp3 with difficult samples.

I used -q8 (or higher) when transcoding with oggenc(1); I've done extensive testing by transcoding back-and-forth to different formats (including RIFF WAV) and have never noticed any decrease in audio quality or frequency response, even when measured with a spectrum analyzer. I do value your point, though.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 22:54 UTC (Thu) by job (guest, #670) [Link] (3 responses)

Just to clarify for everyone (who perhaps stumbles in via a web search): converting from mp3 to ogg, or indeed any time you apply lossy compression to something already lossy compressed, can only make the quality worse. The best case here is "at least not audibly worse".

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 10, 2011 1:04 UTC (Sat) by ibukanov (guest, #3942) [Link] (2 responses)

When one approximate another approximation it is possible the result will be closer to the original than the initial approximation. So in theory one can get better result with MP3->OGG conversion. For this reason if tests show that people cannot detect the difference with the *properly* done conversion, then I do not see how one can claim that it can only made the quality worse.

Lossy format conversion

Posted Dec 10, 2011 15:20 UTC (Sat) by corbet (editor, #1) [Link] (1 responses)

Pretty far off-topic, but: it is a rare situation indeed where the removal of information will improve the fidelity of a signal. One might not be able to hear the difference, but I have a hard time imagining how conversion between lossy formats could do anything but degrade the quality. You can't put back something that the first lossy encoding took out, but you can certainly remove parts of the signal that the first encoding preserved.

Lossy format conversion

Posted Dec 12, 2011 2:54 UTC (Mon) by jimparis (guest, #38647) [Link]

You can't replace missing information, but you could still make something that sounds better -- in a subjective sense. For example, maybe the mp3 has harsh artifacts at higher frequencies that the ogg encoder would remove.

It could apply to lossy image transformations too. Consider this sample set of images. An initial image is pixelated (lossy), and that result is then blurred (also lossy). Some might argue that the final result looks better than the intermediate one, even though all it did was throw away more information.

But I do agree that this is off-topic, and that such improvement is probably rare in practice.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 8:50 UTC (Wed) by ebirdie (guest, #512) [Link]

I did administer a backup server with multidisk LVM volumes and had a persistent problem in one volume with XFS. I even reported the problem to XFS development mailing list. I couldn't find no evident, where the problem came from eventually. My final conclusion was that mkfs.xfs had some bug at the time the volume was created, since newer XFS volumes never got similar problem and after couple upgrades to xfs utilities newer xfs_repair pushed the problem over the edge to non existence.

Lesson learned: it pays to keep data on smaller volumes although it is very very tempting to stuff data to ever bigger volumes and postpone the headache in splitting and managing smaller volumes.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 8:57 UTC (Wed) by Cato (guest, #7643) [Link]

It can be difficult to ensure that writes are getting to disk through caches in the hard disk and kernel (write barriers etc), particularly when using LVM. Turning off hard disk write caching may be necessary in some cases.

This may help: http://serverfault.com/questions/279571/lvm-dangers-and-c...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 21:01 UTC (Wed) by walex (subscriber, #69836) [Link] (58 responses)

As to corruption, you might want to read some nice papers by CERN presented at HEPiX on silent corruption. It happens everywhere, and it can have subtle effects. But as a legendary quote goes, "as far as we know we never had an undetected error" (mainframe IT manager interviewed by Datamation many years ago) is a common position. Thanks to 'ogginfo' you have discovered just how important end-to-end arguments are.

But the main issue is not that, by all accounts 'ext4' is quite reliable (when on a properly setup storage system and properly used by applications).

The big problem with 'ext4' is that its only reason to be is to allow Red Hat customers to upgrade in place existing systems, and what Red Hat wants, Red Hat gets (also because they usually pay for that and the community is very grateful).

Other than that new "typical" systems almost only JFS and XFS make sense (and perhaps in the distant future BTRFS).

In particular JFS should have been the "default" Linux filesystem instead of ext[23] for a long time. Not making JFS the default was probably the single worst strategic decision for Linux (but it can be argued that letting GKH near the kernel was even worse). JFS is still probably (by a significant margin) the best ''all-rounder'' filesystem (XFS beats it in performance only on very parallel large workloads, and it is way more complex, and JFS has two uncommon but amazingly useful special features).

Sure it was very convenient to let people (in particular Red Hat customers) upgrade in place from 'ext' to 'ext2' to 'ext3' to 'ext4' (each in-place upgrade keeping existing files unchanged and usually with terrible performance), but given that when JFS was introduced the Linux base was growing rapidly, new installations could be expected to outnumber old ones very soon, making that point largely moot.

PS: There are other little known good filesystems, like OCFS2 (which is pretty good in non-clustered mode) and NILFS2 (probably going to be very useful on SSDs), but JFS is amazingly still very good. Reiser4 was also very promising (it seems little known that the main developer of BTRFS was also the main developer of Reiser4). As a pet peeve of mine UDF could have been very promising too, as it was quite well suited to RW media like hard disks too (and the Linux implementation almost worked in RW mode on an ordinary partition), and also to SSDs.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 22:07 UTC (Wed) by yokem_55 (guest, #10498) [Link]

I agree that both jfs and disk based RW udf are way underrated. I use jfs on our laptop as it supposedly tends to have less cpu usage, and thus is better for reducing power usage. UDF, if properly supported by the kernel, would make a fantastic fs for accessing data in dual boot situations as Windows has pretty good support and it doesn't have the limitations of vfat nor does it require a nasty, awful, performance sucking hack like ntfs-3g.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 23:12 UTC (Wed) by Lennie (subscriber, #49641) [Link]

The main developer of ext234fs is currently a Google employee.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 0:53 UTC (Thu) by SLi (subscriber, #53131) [Link] (37 responses)

I very much disagree about JFS or XFS being the preferable filesystem on normal Linux use. Believe me, I've tried them both, benchmarked them both, and on almost all counts ext4 outperforms the two by a really wide margin (note that strictly speaking I'm not comparing the filesystems but their Linux implementations). In addition any failures have tended to be much worse on JFS and XFS than on ext4.

The only filesystem, years back, that could have said to outperform ext4 on most counts was ReiserFS 4. Unfortunately on each of the three times I stress tested it I hit different bugs that caused data loss.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 2:03 UTC (Thu) by dlang (guest, #313) [Link]

for a lot of people, ext4 is a pretty new filesystem, just now getting to the point where it has enough of a track record to trust data to.

I haven't benchmarked against ext4, but I have done benchmarks with the filesystems prior to it, and I've run into many cases where JFS and XFS are clear winners.

even against ext4, if you have a fileserver situation where you have lots of drives involved, XFS is still likely to be a win, ext4 just doesn't have enough developers/testers with large numbers of disks to work with (this isn't my opinion, it's a statement from Ted Tso in response to someone pointing out where EXT4 doesn't do as well as XFS with a high performance disk array)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 18:52 UTC (Fri) by walex (subscriber, #69836) [Link] (35 responses)

JFS or XFS being the preferable filesystem on normal Linux use. Believe me, I've tried them both, benchmarked them both, and on almost all counts ext4 outperforms the two by a really wide margin (note that strictly speaking I'm not comparing the filesystems but their Linux implementations). In addition any failures have tended to be much worse on JFS and XFS than on ext4.

Most well done benchmarks I have seen show them mostly equivalent performance, with XFS leading the group in scalability, JFS pretty good across the field, and 'ext4' just like the previous 'ext's being good only on totally freshly loaded filesystems as it packs newly created files pretty densely, and when there is ample caching (no use of 'O_DIRECT'), and both fresh loading and caching mask its fundamental, BSD FFS derived, downsides. It is very very easy to do meaningless filesystem benchmarks (the vast majority that I see on LWN and most others are worthless).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 23:15 UTC (Fri) by tytso (subscriber, #9993) [Link] (34 responses)

One caution about JFS. JFS does not issue cache flush (i.e., barrier) requests, which (a) gives it a speed advantage of file systems that do issue cache flush commands as necessary, and (b) it makes JFS unsafe against power failures. Which is most of the point of having a journal...

So benchmarking JFS against file systems that are engineered to be safe against power failures, such as ext4 and XFS, isn't particularly fair. You can disable cache flushes for both ext4 and XFS, but would you really want to run in an unsafe configuration for production servers? And JFS doesn't even have an option for enabling barrier support, so you can't make it run safely without fixing the file system code.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 0:56 UTC (Sat) by walex (subscriber, #69836) [Link] (31 responses)

As to JFS and performance and barriers with XFS and ext4:

I mentioned JFS as a "general purpose" filesystem, for example desktops, and random servers, in that it should have been the default instead of ext3 (which acquired barriers a bit late).
Anyhow on production servers I personally regard battery backup as essential, as barriers and/or disabling write caching both can have a huge impact, depending on workload.
The speed tests I have done and seen and that I trust are with barriers disabled and either batteries or write caching off, and with O_DIRECT (it is very difficult for me to like any file system test without O_DIRECT). I think these are fair conditions.
Part of the reason why barriers were added to ext3 (and at least initially they had horrible performance) and not JFS is that ext3 was chosen as the default filesystem and thus became community supported and JFS did not.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 1:56 UTC (Sat) by dlang (guest, #313) [Link] (27 responses)

battery backup does not make disabling barriers safe. without barriers, stuff leaves RAM to be sent to the disk at unpredictable times, and so if you loose the contents of RAM (power off, reboot, hang, etc) you can end up with garbage on your disk as a result, even if you have a battery-backed disk controller.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 3:06 UTC (Sat) by raven667 (subscriber, #5198) [Link] (26 responses)

I'm pretty sure, in this context, the OP was talking about battery backed write cache ram on the disk controller

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 6:29 UTC (Sat) by dlang (guest, #313) [Link] (25 responses)

that's what I think as well, and my point is that having battery backed ram on the controller does not make it safe to disable barriers.

it should make barriers very fast so there isn't a big performance hit from leaving them on, but if you disable barriers and think the battery will save you, you are sadly mistaken

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 11:05 UTC (Sat) by nix (subscriber, #2304) [Link] (24 responses)

Really? In that case there's an awful lot of documentation out there that needs rewriting. I was fairly sure that the raison d'etre of battery backup was 1) to make RAID-[56] work in the presence of power failures without data loss, and 2) to eliminate the need to force-flush to disk to ensure data integrity, ever, except if you think your power will be off for so very long that the battery will run out.

If the power is out for months, civilization has probably fallen, and I'll have bigger things to care about than a bit of data loss. Similarly I don't care that battery backup doesn't defend me against people disconnecting the controller or pulling the battery while data is in transit. What other situation does battery backup not defend you against?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 15:39 UTC (Sat) by dlang (guest, #313) [Link] (15 responses)

there are two stages to writing things to a raid array

1. writing from the OS to the raid card

2. writing from the raid card to the drives

battery backup on the raid card makes step 2 reliable. this means that if the data is written to the raid card it should be considered as safe as if it was on the actual drives (it's not quite that safe, but close enough)

However, without barriers, the data isn't sent from the OS to the raid card in any predictable pattern. It's sent at the whim of the OS cache flusing algorithm. This can result in some data making it to the raid controller and other data not making it to raid controller if you have an unclean shutdown. If the data is never sent to the raid controller, then the battery there can't do you any good.

With Barriers, the system can enforce that data gets to raid controller in a particular order, and so the only data that would be lost is the data since the last barrier operation was completed.

note that if you are using software raid, things are much uglier as the OS may have written the stripe to one drive and not to another (barriers only work on a single drive, not across drives). this is one of the places where hardware raid is significantly more robust than software raid.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 18:04 UTC (Sat) by raven667 (subscriber, #5198) [Link] (14 responses)

Maybe I'm wrong but I dont think it works that way. Barriers are there to control the write cache after data has been posted to the storage device, to ensure that the device doesn't report completion until the data is actually perminanely committed. So I think it already works the way you want, filesystems already manage their writes and caching afaik

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 19:31 UTC (Sat) by dlang (guest, #313) [Link] (11 responses)

I'm not quite sure which part of my statement you are disagreeing with

barriers preserve the ordering of writes throughout the entire disk subsystem, so once the filesystem decides that a barrier needs to be at a particular place, going through a layer of LVM (before it supported barriers) would run the risk of the writes getting out of order

with barriers on software raid, the raid layer won't let the writes on a particular disk get out of order, but it doesn't enforce that all writes before the barrier on disk 1 get written before the writes after the barrier on disk 2

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 6:17 UTC (Sun) by raven667 (subscriber, #5198) [Link] (10 responses)

I guess I was under the impression, incorrect it may be, that the concepts of write barriers were already baked into most responsible filesystems but that the support for working through LVM was recent (in the last 5yrs) and the support for actually issuing the right commands to the storage and having the storage respect them was also more recent. Maybe I'm wrong and barriers as a concept are newer.

In any event there is a bright line between how the kernel handles internal data structures and what the hardware does and for storage with battery backed write cache once an IO is posted to the storage it is as good as done so there is no need to ask the storage to commit its blocks in any particular fashion. The only issue is that the kernel issue the IO requests in a responsible manner.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 6:41 UTC (Sun) by dlang (guest, #313) [Link] (8 responses)

barriers as a concept are not new, but your assumption that filesystems support them is the issue.

per the messages earlier in this thread, JFS does not, for a long time (even after it was the default in Fedora), LVM did not.

so barriers actually working correctly is relatively new (and very recently they have found more efficient ways to enforce ordering than the older version of barriers.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 11:24 UTC (Sun) by tytso (subscriber, #9993) [Link]

JFS still to this day does not issues barriers / cache flushes.

It shouldn't be that hard to add support, but no one is doing any development work on it.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 16:26 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link] (6 responses)

JFS has never been default in Fedora.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 16:50 UTC (Sun) by dlang (guest, #313) [Link] (5 responses)

I didn't think that I ever implied that it was.

Fedora has actually been rather limited in it's support of various filesystems. The kernel supports the different filesystems, but the installer hasn't given you the option of using XFS and JFS for your main filsystem for example.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 17:41 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link] (4 responses)

It appears you did

"JFS does not, for a long time (even after it was the default in Fedora)"

You are inaccurate about your claim on the installer as well. XFS is a standard option in Fedora for several releases ever since Red Hat hired Eric Sandeen from SGI to maintain it (and help develop Ext4). JFS is a non-standard option.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 19:22 UTC (Sun) by dlang (guest, #313) [Link] (3 responses)

re: JFS, oops, I don't know what I was thinking when I typed that.

re: XFS, I've been using linux since '94, so XFS support in the installer is very recent :-)

I haven't been using Fedora for quite a while, my experiance to RedHat distros is mostly RHEL (and CentOS), which lag behind. I believe that RHEL5 still didn't support XFS in the installer

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 19:53 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

"Very recent" is relative and not quite so accurate either. All versions of Fedora installer have supported XFS. You just had to pass "xfs" as a installer option. Same with jfs or reiserfs. Atleast Fedora 10 beta onwards supports XFS as a standard option without having to do anything

http://fedoraproject.org/wiki/Releases/10/Beta/ReleaseNot...

That is early 2008. RHEL 6 has xfs support as a add-on subscription and is supported within the installer as well IIRC.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 16:15 UTC (Mon) by wookey (guest, #5501) [Link] (1 responses)

I think dlang meant this:
"..., for a long time (even after it was the default in Fedora), LVM did not"

(I parsed it the way rahulsundaram did too - it's not clear).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 16:59 UTC (Mon) by dlang (guest, #313) [Link]

yes, now that you say that ir reminds me that I was meaning that for a long time after LVM was the default on Fedora, it didn't support barriers.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Jan 30, 2012 8:50 UTC (Mon) by sbergman27 (guest, #10767) [Link]

Old thread, I know. But why people are still talking about barriers I'm not sure. Abandoning the use of barriers was agreed upon at the 2010 Linux Filesystem Summit. And they completed their departure in 2.6.37, IIRC. Barriers are no more. They don't matter. They've been replaced by FUA, etc.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 17:54 UTC (Thu) by nye (subscriber, #51576) [Link] (1 responses)

>Barriers are there to control the write cache after data has been posted to the storage device, to ensure that the device doesn't report completion until the data is actually perminanely committed

Surely what you're describing is a cache flush, not a barrier?

A barrier is intended to control the *order* in which two pieces of data are written, not when or even *if* they're written. A barrier *could* be implemented by issuing a cache flush in between writes (maybe this is what's commonly done in practice?) but in that case you're getting slightly more than you asked for (ie. you're getting durability of the first write), with a corresponding performance impact.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 23:24 UTC (Thu) by raven667 (subscriber, #5198) [Link]

I think you are right, I may have misspoke.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 12:01 UTC (Mon) by jlokier (guest, #52227) [Link] (7 responses)

I believe dlang is right. You need to enable barriers even with battery-backed disk write cache. If the storage device has a good implementation, the cache flush requests (used to implement barriers) will be low overhead.

Some battery-backed disk write caches can commit the RAM to flash storage or something else, on battery power, in the event that the power supply is removed for a long time. These systems don't need a large battery and provide stronger long-term guarantees.

Even ignoring ext3's no barrier default, and LVM missing them for ages, there is the kernel I/O queue (elevator) which can reorder requests. If the filesystem issues barrier requests, the elevator will send writes to the storage device in the correct order. If you turn off barriers in the filesystem when mounting, the kernel elevator is free to send writes out of order; then after a system crash, the system recovery will find inconsistent data from the storage unit. This can happen even after a normal crash such as a kernel panic or hard-reboot, no power loss required.

Whether that can happen when you tell the filesystem not to bother with barriers depends on the filesystem's implementation. To be honest, I don't know how ext3/4, xfs, btrfs etc. behave in that case. I always use barriers :-)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 15:40 UTC (Mon) by andresfreund (subscriber, #69562) [Link] (6 responses)

I think these days any sensible fs actually waits for the writes to reach storage independent of barrier usage. The only different with barriers on/off is whether a FUA/barrier/whatever is sent to the device to force the device to write out the data.
I am rather sure at least ext4 and xfs do it that way.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:14 UTC (Mon) by dlang (guest, #313) [Link] (5 responses)

no, jlokier is right, barriers are still needed to enforce ordering

there is no modern filesystem that waits for the data to be written before proceeding. Every single filesystem out there will allow it's writes to be cached and actually written out later (in some cases, this can be _much_ later)

when the OS finally gets around to writing the data out, it has no idea what the application (or filesystem) cares about, unless there are barriers issued to tell the OS that 'these writes must happen before these other writes'

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:15 UTC (Mon) by andresfreund (subscriber, #69562) [Link] (4 responses)

The do wait for journaled data uppon journal commit. Which is the place where barriers are issued anyway.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:39 UTC (Mon) by dlang (guest, #313) [Link] (3 responses)

issueing barriers is _how_ the filesystem 'waits'

it actually doesn't stop processing requests and wait for the confirmation from the disk, it issues a barrier to tell the rest of the storage stack not to reorder around that point and goes on to process the next requrest and get it in flight.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:53 UTC (Mon) by andresfreund (subscriber, #69562) [Link] (2 responses)

Err. Read the code. xfs uses io completion callbacks and only relies on the contents of the journal after the completion returned. (xlog_sync()->xlog_bdstrat()->xfs_buf_iorequest()->_xfs_buf_ioend()).
jbd does something similar but I don't want to look it up unless youre really interested.

It worked a littlebit more like you describe before 2.6.37 but back then it waited if barriers were disabled.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 13, 2011 13:35 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

Well, this is clear as mud :) guess I'd better do some code reading and figure out wtf the properties of the system actually are...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 13, 2011 13:38 UTC (Tue) by andresfreund (subscriber, #69562) [Link]

If you want I can give you the approx calltrace for jbd2 as well, I know it took me some time when I looked it up...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 11:00 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

You got that backwards. Filesystems do not become community-supported because they are chosen as a default (though if they are common, community members *are* more likely to show an interest in them). It is more that they are very unlikely ever to be chosen as a default by anyone except their originator unless they are already community-supported.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 18:06 UTC (Sat) by raven667 (subscriber, #5198) [Link]

Reiserfs3 being an example of that, being widely shipped but unsupported and unsupportable by the community leading to more stringent support guidelines for future code acceptance

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 20:33 UTC (Sat) by tytso (subscriber, #9993) [Link]

Again, you have this backwards. Ext3 was chosen in part because it was a community-supported file system. From the very beginning, ext2 and ext3 had support from a broad set of developers, at a large number of ***different*** companies. Of the original three major developers of ext2/ext3 (Remy Card, Stephen Tweedie, and myself), only Stephen worked at Red Hat. Remy was a professor at a university in France, and I was working at MIT as the technical lead for Kerberos. And there were many other people submitting contributions to ext3 and choosing to use ext3 in embedded products (including Andrew Morton, when he worked at Digeo between 2001 and 2003).

ext3 was first supported by RHEL as of RHEL 2 which was released May 2003 --- and as you can see from the dates above, we had developers working at a wide range of companies, thus making it a communuty-supported distribution, long before Red Hat supported ext3 in their RHEL product. In contrast, most of the reiserfs developers worked at Namesys (with a one or two exceptions, most notably Chris Mason when he was at SuSE), and most of the XFS developers worked at SGI.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 16:29 UTC (Mon) by wookey (guest, #5501) [Link] (1 responses)

I'm very surprised by the assertion that XFS is intended to be safe against power failures, as it directly condtradicts my experience. I found it to be a nice filesystem with some cool features (live resizing was really cool back circa 2005/6), but I also found (twice, on different machines) that it was emphatically not designed for systems without UPS. In both caces a power failure caused significant filesystem corruption (those machines had lvm as well as XFS).

When I managed to repair them I found that many files had big blocks of zeros in them - essentially anything that was in the journal and had not been written. Up to that point I had naively thought that the point of the journal was to keep actual data, not just filesystem metadata. Files that have been 'repaired' by being silently filled with big chunks of zeros did not impress me.

So I now believe that XFS is/was good, but only on properly UPSed servers. Am I wrong about that?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 17:03 UTC (Mon) by dlang (guest, #313) [Link]

for a very long time, LVS did not support barriers, which means that _any_ filesystem running on top of LVM could not be safe.

XFS caches more stuff than ext does, so a crash looses more stuff.

so XFS or ext* with barriers disabled is not good to use, For a long time, running these things on top of LVM had the side effect of disabling barriers, it's only recently that LVM gained the ability to support them

JFS is not good to use (as it doesn't have barriers at all)

note that when XFS is designed to be safe, that doesn't mean that it won't loose data, just that the metadata will not be corrupt.

the only way to not loose data in a crash/power failure is to do no buffering at all, and that will absolutely kill your performance (and we are talking hundreds of times slower, not just a few percentage points)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 2:58 UTC (Thu) by tytso (subscriber, #9993) [Link] (2 responses)

The main reason JFS wasn't accepted in the community was because all of the developers worked at IBM. Very few people in the other distributions understood it, which meant that there weren't people who could support at the distro's. One of the things that I've always been very happy about is the fact that developers for ext2/3/4 come from many, many different companies.

JFS was a very good file system, and at the time when it was released, it certainly was better than ext3. But there's a lot more to having a successful open source project beyond having the best technology. The fact that ext2 was well understood, and had a mature set of file system utilities, including tools like "debugfs", are one of the things that do make a huge difference towards people accepting the technology.

At this point, though, ext4 has a number of features which JFS lacks, including delayed allocation, fallocate, punch, and TRIM/discard support. These are all features which I'm sure JFS would have developed if it still had a development community, but when IBM decided to defund the project, there were few or no developers who were not IBM'ers, and so the project stalled out.

---

People who upgrade in place from ext3 to ext4 will see roughly half the performance increase compared to doing a backup, reformat to ext4, and restore operation. But they *do* see a performance increase if they do an upgrade-in-place operation. In fact, even if they don't upgrade the file system image, and use ext4 to mount an ext2 file system image, they will see some performance improvement. So this gives them flexibility, which from a system administrator's point of view, is very, very important!

---

Finally, I find it interesting that you consider OCFS2 "pretty good" in non-clustered mode. OCFS2 is a fork of the ext3 code base[1] (it even uses fs/jbd and now fs/jbd2) with support added for clustered operation, and with support for extents (which ext4 has as well, of course). It doesn't have delayed allocation. But ext4 will be better than ocfs2 in non-clustered mode, simply because it's been optimized for it. The fact that you seem to think OCFS2 to be "pretty good", while you don't seem to think much about ext4 makes me wondered if you have some pretty strong biases against the ext[234] file system family.

[1] Ocfs2progs is also a fork of e2fsprogs. Which they did with my blessing, BTW. I'm glad to see that the code that has come out of the ext[234] project have been useful in so many places. Heck, parts of the e2fsprogs (the UUID library, which I relicensed to BSD for Apple's benefit) can be found in Mac OS X! :-)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 20:25 UTC (Thu) by sniper (guest, #13219) [Link] (1 responses)

Small correction.

ocfs2 is not a fork of ext3 and neither is ocfs2-tools a fork of e2fsprogs. But both have benefited a _lot_ from ext3. In some instances, we copied code (non-indexed dir layout). In some instances, we used a different approach because of collective experience (indexed dir). grep ext3 fs/ocfs2/* for more.

The toolset has a lot more similarities to e2fsprogs. It was modeled after it because it is well designed and to also allow admins to quickly learn it. The tools even use the same parameter names where possible. grep -r e2fsprogs * for more.

BTW, ocfs2 has had bigalloc (aka clusters) since day 1, inline-data since 2.6.24 and metadata checksums since 2.6.29. Yes, it does not have delayed allocations.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Apr 13, 2012 19:30 UTC (Fri) by fragmede (guest, #50925) [Link]

OCFS2 does have snapshots though, which is why I use it. :)

LVM snapshots are a joke if you have *lots* of snapshots, though I haven't looked at btrfs snapshots since it became production ready.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 3:22 UTC (Thu) by tytso (subscriber, #9993) [Link]

One other thought. At least at the beginning ext4's raison d'etre (its reason for being) was as a stopgap file system until btrfs could be ready. We started with ext3 code which was proven, solid code, and support for delayed allocation, multiblock allocation, and extents had also been in use for quite a while in Clustrefs's Lustre product. So that code wasn't exactly new, either. What I did was integrate Cluterfs's contributions, and then worked on stablizing them so that we would have something that was better than ext3 ready in the short term.

At the time when I started working on ext4, XFS developers were all mostly still working for SGI, so there was a similar problem with the distributions not having anyone who could support or debugfs XFS problems. This has changed more recently, as more and more XFS developers have left (volunteraliy or involuntarily) SGI and joined companies such as Red Hat. XFS has also improved its small file performance, which was something it didn't do particularly well simply because SGI didn't optimize for that; its sweet spot was and still is really large files on huge RAID arrays.

One of the reasons why I felt it was necessary to work on ext4 was that everyone I talked to who had created a file system before in the industry, whether it was GPFS (IBM's cluster file system), or Digital Unix's advfs, or Sun's ZFS, gave estimates of somewhere between 50 to 200 person years worth of effort before the file system was "ready". Even if we assume that open source development practices would make development go twice as fast, and if we ignore the high end of the range because cluster file systems are hard, I was skeptical it would get done in two years (which was the original estimate) given the number of developers it was likely to attract. Given that btrfs started at the beginning of 2007, and here we are almost at 2012, I'd say my fears were justified.

At this point, I'm actually finding that ext4 has found a second life as a server file system in large cloud data centers. It turns out that if you don't need the fancy-shamcy features that Copy-on-Write file systems give you, they aren't free. In particular, ZFS has truly a prodigious appetite for memory, and one of the things about cloud servers is that in order for them to make economic sense, you try to pack as many jobs or VM's on them, so they are constantly under memory pressure. We've done some further optimizations so that ext4 performs much better when under memory pressure, and I suspect at this point that in a cloud setting, using a CoW file system may simply not make sense.

Once btrfs is ready for some serious benchmarking, it would be interesting to benchmark it under serious memory pressure, and see how well it performs. Previous CoW file systems, such as BSD's lfs two decades ago, and ZFS more recently, have needed a lot of memory to cache metadata blocks, and it will be interesting to see if btrfs has similar issues.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 19:36 UTC (Thu) by nix (subscriber, #2304) [Link] (13 responses)

I shouldn't respond to this troll-bait, but nonetheless...

The big problem with 'ext4' is that its only reason to be is to allow Red Hat customers to upgrade in place existing systems, and what Red Hat wants, Red Hat gets (also because they usually pay for that and the community is very grateful).

Interesting. tytso wasn't working for RH when ext4 started up, and still isn't working for them now. So their influence must be more subtle.

I also see that I was making some sort of horrible mistake by installing ext4 on all my newer systems, but you never make clear what that mistake might have been.

In particular JFS should have been the "default" Linux filesystem instead of ext[23] for a long time. Not making JFS the default was probably the single worst strategic decision for Linux (but it can be argued that letting GKH near the kernel was even worse).

Ah, yeah. Because stable kernels, USB support, mentoring newbies, the driver core, -staging... all these things were bad.

I've been wracking my brains and I can't think of one thing Greg has done that has come to public knowledge and could be considered bad. So this looks like groundless personal animosity to me.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 19:41 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

> I've been wracking my brains and I can't think of one thing Greg has done that has come to public knowledge and could be considered bad. So this looks like groundless personal animosity to me.
Also, uhm. Didn't he work for Suse?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 11:35 UTC (Fri) by alankila (guest, #47141) [Link] (5 responses)

I dimly recall that the animosity originated from the work with udev, and the removal of devfs. Since I personally don't care one bit about this issue, I have hard time now reconstructing the relevant arguments, but my guess is that some people really hate the idea that a system needs more than just kernel to be useful.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 18:40 UTC (Fri) by nix (subscriber, #2304) [Link]

udev is prone to creating frothing-at-the-mouth even in otherwise reasonable people, due to the udev authors' patent lack of concern for backward compatibility. Twice now they've broken existing systems without so much as a by-your-leave: firstly with the massive migration of all system-provided state out of /etc/udev.d/rules into /lib/udev/rules: what, you customized them? sucks to be you, now you have to customize them before *building* udev, and more recently with the abrupt movement of /sbin/udevd into /lib/udev without even leaving behind a symlink! Oh, you were starting that at bootup and relying on it to be there? Sorry, we just broke your bootup, your own fault for not reading the release notes! Hope you don't need to downgrade!

(Yes, I read the release notes, so didn't fall into these traps, but FFS, at least the latter problem was trivial to work around -- one line in the makefile to drop a symlink in /sbin -- and they just didn't bother.)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 23:40 UTC (Fri) by walex (subscriber, #69836) [Link] (3 responses)

As to udev some people dislike smarmy shysters who replace well designed working subsystems seemingly for the sole reason of making a political landgrab, because the replacement has both more kernel complexity and more userland complexity and less stability.

The key features of devfs were that it would populate automatically /dev from the kernel with basic device files (major, minor) and then use a very simple userland daemon to add extra aliases as required.

It turns out that after several attempts to get it to work udev adds to /sys from inside the kernel exactly the same information, so there has been no migration of functionality from kernel to userspace:

$ ls -ld /dev/tty9
crw--w---- 1 root tty 4, 9 2011-11-28 14:03 /dev/tty9
$ cat /sys/class/tty/tty9/dev
4:9

And the userland part is also far more complex and unstable than devfsd ever was (for example devfs did not require cold start).

And udev is just the most shining example of a series of similar poor decisions (which however seem to have been improving a bit with time).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 3:16 UTC (Sat) by raven667 (subscriber, #5198) [Link] (1 responses)

I'm not sure that is an accurate portrayal of what happened, on this planet at least. My recollection from the time is that there were fundamental technical problems with the devfs implementation which is why it was redone into udev. I think those problems were some inherent race conditions on device add/removal, plus concerns about how much policy about /dev file names, permissions, etc was hard coded into the kernel and unmodifyable by an end user or sysadmin. That is just my recollection.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 11:07 UTC (Sat) by nix (subscriber, #2304) [Link]

The latter is doubly ironic now that udev forbids you from changing the names given to devices by the kernel. (You can introduce new names, but you can't change the kernel's anymore.)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 4:04 UTC (Sat) by alankila (guest, #47141) [Link]

To your specific example: obviously the kernel is going to have some kind of (generated) name for a device, and to know the major/minor number pair which is the very thing that faciliates the communication between userspace and kernel... But udev is still controlling things like permissions and aliases for those devices where necessary.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 0:12 UTC (Sat) by walex (subscriber, #69836) [Link] (5 responses)

«tytso wasn't working for RH when ext4 started up, and still isn't working for them now. So their influence must be more subtle. »

Quite irrelevant: a lot of file system were somebody's hobby file systems, but they did not achieve prominence and instant integration into mainline even if rather alpha, and RedHat did not spend enormous amounts of resources quality assuring them to make them production ready either, and quality assurance is a pretty vital detail for file systems, as the Namesys people discovered.

Pointing to tytso is just misleading. Also because ext4 really was seeded by Lustre people before tytso became active on it in his role as ext3 curator (and in 2005, which is 5 years later than when JFS became available).

Similarly for BTRFS, it has been initiated by Oracle (who have an ext3 installed base), but its main appeal is still as the next inplace upgrade on the Red Hat installed base (thus the interest in trialing it in Fedora, where EL candidate stuff is mass tested), even if for once it is not just an extension of the ext line but has some interesting new angles.

But considering ext4 on its own is a partial view; one must consider the pre-existing JFS and XFS stability and robustness and performance, and from a technical point of view ext4 is not that interesting (euphemism) and its sole appeal is inplace upgrades, and the widest installed based for that is RedHat, and to a large extent that could have been said of ext3 too.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 0:52 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

So you're blaming the Lustre people now? You do realise Lustre is not owned by Red Hat, and never was?

And if you're claiming that btrfs is effectively RH-controlled merely because RH customers will benefit, then *everything* that happens to Linux must by your bizarre definition be RH-controlled. That's a hell of a conspiracy: so vague that the coconspirators don't even realise they're conspiring!

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Apr 13, 2012 19:34 UTC (Fri) by fragmede (guest, #50925) [Link]

I though *Oracle* was a/the big contributor to btrfs...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 19:45 UTC (Sat) by tytso (subscriber, #9993) [Link] (2 responses)

Sure, and I've always been careful to give the Lustre folk credit for the work that they did between 2003 and 2006 extending ext3 to add support for delayed allocation (which JFS didn't have), multi-block allocation (which JFS didn't have) and extents (OK, JFS had extents).

But you can't have it both ways. If that code had been in use by paying Lustre companies, then it's hardly alpha code, wouldn't you agree?

And why did the Lustre developers at Clustrefs chose ext3? Because the engineers they hired knew ext3, since it was a community-supported distribution, whereas JFS was controlled by a core team that was all IBM'ers, and hardly anyone outside of IBM was available who knew JFS really well.

But as others have already pointed out, there was no grand conspiracy to pick ext2/3/4 over its competition. It won partially due to its installed base, and partially because of the availability of developers who understood it (and books written about it, etc., etc., etc.) The way you've been writing you seem to think there was some secret cabal (at Red Hat?) that made these decisions, and there was a "mistake" because they didn't chose your favorite file systems.

The reality is that file systems all have trade-offs, and what's good for some people are not so great for others. Take a look at some of the benchmarks at btrfs.boxacle.net; they're a bit old, but they are well done, and they show that across many different workloads at that time (2-3 years ago) there was no one single file system that was the best across all of the different workloads. So anyone who only uses a single workload, or a single hardware configuration, and tries to use that to prove that their favorite file system is the "best" is trying to sell you something, or who is a slashdot kiddie who has a fan-favorite file system. The reality is a lot more complicated than that, and it's not just about performance. (Truth be told, for many/most uses cases, the file system is not the bottleneck.) Issues like availability of engineers to support the file system in a commercial product, the maturity of the userspace support tools, ease of maintainability, etc. are at least as important if not more so.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 20:43 UTC (Sat) by dlang (guest, #313) [Link] (1 responses)

at the time ext3 became the standard, JFS and XFS had little support (single vendor) and were both 'glued on' to linux with heavy compatibility layers.

Add to this the fact that you did not need to reformat your system to use ext3 when upgrading, and the fact that ext3 became the standard (taking over from ext2, which was the prior standard) is a no-brainer, and no conspiracy.

In those days XFS would outperform ext3, but only in benchmarks on massive disk arrays (which were even more out of people's price ranges at that point then they are today)

XFS was scalable to high-end systems, but it's low-end performance was mediocre

looking at things nowdays, XFS has had a lot of continuous improvement and integration, both improving it's high-end performance and reliability, and improving it's low-end performance without loosing it's scalability. There are also more people, working for more companies supporting it, making it far less of a risk today, with far more in the way of upsides.

JFS has received very little attention after the initial code dump from IBM, and there is now nobody actively maintaining/improving it, so it really isn't a good choice going forward.

reiserfs had some interesting features and performance, but it suffered from some seriously questionably benchmarking (the one that turned me off to it entirely was a spectacular benchmarking test that reiserfs completed in 20 seconds that took several minutes on ext*, but then we discovered that reiserfs defaulted to a 30 second delay before writing everything to disk, so the entire benchmark was complete before any day started getting written to disk, after that I didn't trust anything that they claimed), and a few major problems (the fsck scrambling is a huge one). It was then abandoned by the developer in favor of the future reiserfs4, with improvements that were submitted being rejected as they were going to be part of the new, incompatible filesystem.

ext4 is in large part a new filesystem who's name just happens to be similar to what people are running, but it has now been out for several years, with developers who are responsive to issues, are a diverse set (no vendor lock-in or dependencies) and are willing to say where the filesystem is not the best choice.

btrfs is still under development (the fact that they don't yet have a fsck tool is telling), is making claims that seem too good to be true, and have already run into several cases where they have pathalogical behavior and have had to modify things significantly. I wouldn't trust it for anything other than non-critical personal use for another several years.

as a result, I am currently using XFS for the most part, but once I get a chance to do another round of testing, ext4 will probably join it. I have a number of systems that have significant numbers of disks, so XFS will probably remain in use.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 1:12 UTC (Sun) by nix (subscriber, #2304) [Link]

ext4 is in large part a new filesystem who's name just happens to be similar to what people are running

ext4 is ext3 with a bunch of new extensions (some incompatible): indeed, initially the ext4 work was going to be done to ext3, until Linus asked for it to be done in a newly-named clone of the code instead. It says a lot for the ext2 code and disk formats that they've been evolvable to this degree.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 3:46 UTC (Thu) by eli (guest, #11265) [Link] (1 responses)

You mention that you have backups. Have you done a bit-wise comparison of the corrupted files versus the backups to see how much was actually corrupted and if there might be some pattern to the corruption in the file? Given the points made about bit-flipping vs blocks dropped (or replaced by something else), a "cmp -l" might help you find the root cause.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 6, 2011 1:06 UTC (Tue) by pr1268 (subscriber, #24648) [Link]

Thanks for the suggestion; I'll give it a try sometime if/when I find a corrupt OGG file.

Just to bring some closure to this discussion, I wish to make a few points:

In 13 years of using Linux, I've not once ever experienced data corruption on any filesystem running in Linux (prior to the random OGG errors I encountered a few months ago on ext4 and LVM). This includes ext2, ext3, Reiser(3)fs, and Linux's fat and vfat implementations. (Hardware failures are not included, but I've been fortunate not to have much experience with those.)
My existing hard drives are error-free (according to smartctl -l as suggested above). I have two IDE disks and two SATA disks, all Seagate brand (not that this matters, but I do admire their 5-year warranty).
My OGG files are all on their own filesystem/partition (ext4), for which only root has write privileges, and thus my non-privileged account can't accidentally be writing to this filesystem (or so I'd hope).
I've since created an oggcheck.sh script which scans the whole filesystem containing the OGG files and reports any errors found to a log.
Running this script lately (within past 1 month or so) has yielded no errors, so I'm wondering if (hoping that) this is some weird anomaly which has passed.
Any errors experienced on the OGG filesystem would imply the possibility of similar errors on my / filesystem (which includes /home), but I haven't noticed any such errors as of yet (and I hope that none exist!).
I'm all too eager to blame this on ext4 or LVM (or the combination of both), but I'm equally eager to blame this on some strange operator error. Trust me, I'm good at creating these!

Many thanks to everyone's discussion above; I always learn a lot from the comments here on LWN.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 15:34 UTC (Thu) by lopgok (guest, #43164) [Link] (8 responses)

You should generate a checksum for each file in your filesystem.
I wrote a trivial python script to generate a checksum file for each directory's files. If you run it, and it finds a checksum file, it checks that the files in the directory match the checksum file, and if they don't it reports that.

I wrote it when I had a serverworks chipset on my motherboard that corrupted IDE hard drives when DMA was enabled. However, the utility lets me know there is no bit rot in my files.

It can be found at http://jdeifik.com/ , look for 'md5sum a directory tree'. It is GPL3 code. It works independently from the files being checksummed and independently of the file system. I have found flaky disks that passed every other test with this utility.

The other thing that can corrupt files is memory errors. Many new computers do not support ECC memory. If you care about data integrity, you should use ECC memory. Intel has this feature for their server chips (xeons) and AMD has this feature for all ofgf their processors (though not all motherboard makers support it).
It is very cheap insurance.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 16:24 UTC (Thu) by nix (subscriber, #2304) [Link] (7 responses)

It is very cheap insurance.

Look at the price differential between the motherboards and CPUs that support ECCRAM and those that do not. Now add in the extra cost of the RAM.

ECCRAM is worthwhile, but it is not at all cheap once you factor all that in.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 17:47 UTC (Thu) by tytso (subscriber, #9993) [Link] (6 responses)

Whether or not it is cheap or not depends on how much you value your data.

It's like people who balk at spending an extra $200 to mirror their data, or to provide a hot spare for their RAID array. How much would you be willing to spend to get back your data after you discover it's been vaporized? What kind of chances are you willing to take against that eventuality happen?

It will vary depending on each person, but traditional people are terrible and figuring out cost/benefit tradeoffs.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 19:10 UTC (Thu) by nix (subscriber, #2304) [Link] (5 responses)

Yep. That's why I said it was worthwhile. But 'very cheap'? Not unless 'cheap' means 'costs much more money than other alternatives'. Yes, it has benefits, but immediate financial return is not one of them.

(Also, last time I tried you couldn't buy a desktop with ECCRAM for love nor money. Servers, sure, but not desktops. So of course all my work stays on the server with battery-backed hardware RAID and ECCRAM, and I just have to hope the desktop doesn't corrupt it in transit.)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 9, 2011 0:57 UTC (Fri) by tytso (subscriber, #9993) [Link] (2 responses)

What I have under my desk at work (and I'm quite happy with it) is the Dell T3500 Precision Workstation, which supports up to 24GB of ECC or non-ECC memory. It's not a mini-ATX desktop, but it's definitely not a server, either.

I really like how quickly I can build kernels on this machine. :-)

I'll grant it's not "cheap" in absolute terms, but I've always believed that skimping on a craftsman's tools is false economy.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 9, 2011 7:41 UTC (Fri) by quotemstr (subscriber, #45331) [Link]

> Dell T3500 Precision Workstation, which supports up to 24GB of ECC or non-ECC memory.

I have the same machine. Oddly enough, it only supports 12GB of non-ECC memory, at least according to Dell's manual. How does that happen?

(Also, Intel's processor datasheet claims that several hundred gigabytes of either ECC or non-ECC memory should be supported using the integrated memory controller. I wonder why Dell's system supports less.)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 9, 2011 12:40 UTC (Fri) by nix (subscriber, #2304) [Link]

Oh, agreed. I've seen multiple rounds of friends deciding to save money on a cheap PC, trying to do real work on it, and finding the result a crashy erratic data-corrupting horror that is almost impossible to debug unless you have a second identical machine to swap parts out of... and losing years of working time to these unreliable nightmares. I pay a bit more (well, OK, quite a lot more) and those problems simply don't happen. I don't think this is ECCRAM, though: I think it's simply a matter of tested components with a decent safety margin rather than bargain-basement junk.

EDAC support for my Nehalem systems landed in mainline a couple of years ago but I'll admit to never having looked into how to get it to tell me what errors may have been corrected, so I have no idea how frequent they might be.

(And if it didn't mean dealing with Dell I might consider one of those machines myself...)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 9, 2011 13:53 UTC (Fri) by james (subscriber, #1325) [Link] (1 responses)

AMD processors since the Athlon 64 all support ECC, and most Asus AMD boards (even cheap ones) wire the lines up.

Even ECC memory isn't that much more expensive: Crucial do a 2x2GB ECC kit for £27 + VAT ($42 in the US) against £19 ($30).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 9, 2011 15:19 UTC (Fri) by lopgok (guest, #43164) [Link]

I agree. The last 3 motherboards I have bought were for AMD processors. I bought a 3 core phenom II, an asus motherboard, and 4gb of ECC ram for around $200. I have no idea why Intel only supports ECC on their server motherboards. For me, this is a critical feature. In my experience, many Gigabyte motherboards do not support ECC, so check the motherboard manual, or list of supported memory before buying. In fact AMD supports IBM's Chipkill technology which will detect 4 bit errors and correct 3 bit errors. In addition, my Asus motherboards support memory scrubbing, which can help detect memory errors in a timely fashion.

If you buy assembled computers and can't get ECC support without spending big bucks, it is time to switch vendors.

It is true that ECC memory is more expensive and less available than non-ECC memory, but the price difference is around 20% or so, and Newegg and others sell a wide variety of ECC memory. Mainstream memory manufacturers, including Kingston sell ECC memory.

Of course, virtually all server computers come with ECC memory.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Jan 15, 2012 3:45 UTC (Sun) by sbergman27 (guest, #10767) [Link]

Mount with "nodelalloc". I have servers which host quite a few Cobol C/ISAM files. I was uncomfortable with the very idea of delayed allocation. But the EXT4 delayed allocation cheer-leading section, headed by Ted T'So, convinced me that after 2.6.30, it would be OK.

The very first time we had an power failure, with a UPS with a bad battery, we experienced corruption in several files of those files. Never *ever* *ever* had we experienced such a thing with EXT3. I immediately added nodelalloc as a mount option, and the EXT4 filesystem now seems as resilient as EXT3 ever was. Note that at around the same time as 2.6.30, EXT3 was made less reliable by adding the same 2.6.30 patches to it, and making data=writeback the default journalling mode. So if you do move back to EXT3, make sure to mount with data=journal.

BTW, I've not noted any performance differences mounting EXT4 with nodelalloc. Maybe in a side by side benchmark comparison I'd detect something.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Feb 19, 2013 10:23 UTC (Tue) by Cato (guest, #7643) [Link]

For LVM and write caching setup generally, see http://serverfault.com/questions/279571/lvm-dangers-and-c...

You might also like to try ZFS or btrfs - both have enough built-in checksumming that they should detect issues sooner, though in this case Ogg's checksumming is doing that for audio files. With a checksumming FS you could detect whether the corruption is in RAM (seen when writing to file) or on disk (seen when reading from file). ZFS also does periodic scrubbing to validate checksums.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 1:07 UTC (Wed) by cmccabe (guest, #60281) [Link] (39 responses)

bigalloc sounds like it could be really good for SSDs.

Especially if mkfs could somehow align the block clusters to the flash page size.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 8:35 UTC (Wed) by ebirdie (guest, #512) [Link] (4 responses)

...and bigalloc sounds good for volumes holding VM image files or other sparse files.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 21:05 UTC (Wed) by walex (subscriber, #69836) [Link] (3 responses)

Sparse VM virtual disks are usually a very big performance mistake. There is always the option to have VM virtual disks as logical volumes under LVM2, which usually is enormously better than as large files under any filesystem.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 23:04 UTC (Thu) by job (guest, #670) [Link] (2 responses)

Did you benchmark that? I tried it once under VMware and I was surprised to find out that the opposite was true. It may of course have been a fluke or a result of some VMware-specific behaviour and I did not pursue it further. I'm just interested in your findings.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 0:34 UTC (Sat) by walex (subscriber, #69836) [Link] (1 responses)

I had to completely re-vase a set of virtual machines that were installed by some less thoughtful predecessor into growable VMware virtual disks. Several virtual disks had several hundred thousand extents (measured with filefrag) and a couple had over a million, all mixed up randomly on the real disk. Performance was horrifying (it did not help that there were another two absurd choices in the setup).

I ended up with just the mostly readonly filesystem in the VM disk, and all the writable subtrees mounted via NFS from the host machine, which was much faster. In particular during backups, because I could run the backup program (BackupPC based on RSYNC) on the real machine and remote backup is a very high IO load operation, and running it inside the virtual machines on a virtual disk was much much slower.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 6, 2011 21:26 UTC (Tue) by job (guest, #670) [Link]

Sorry, I didn't type that clearly. I meant using LVM volumes as virtual disks, not using sparse virtual disks.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 20:22 UTC (Wed) by walex (subscriber, #69836) [Link] (33 responses)

No journaled filesystem is good for SSDs.

And this is relevant to ext4... exactly how?

Posted Nov 30, 2011 21:33 UTC (Wed) by khim (subscriber, #9252) [Link] (10 responses)

Take a look here. Note the linux version number...

And this is relevant to ext4... exactly how?

Posted Nov 30, 2011 23:16 UTC (Wed) by Lennie (subscriber, #49641) [Link] (9 responses)

Google stores it's data on ext4 without journal:

http://www.youtube.com/watch?v=Wp5Ehw7ByuU

And this is relevant to ext4... exactly how?

Posted Dec 1, 2011 1:01 UTC (Thu) by SLi (subscriber, #53131) [Link] (8 responses)

Then again Google normally has three copies of every piece of important data on different computers, so they're not too concerned about failures due to not journaling.

And this is relevant to ext4... exactly how?

Posted Dec 1, 2011 1:59 UTC (Thu) by dlang (guest, #313) [Link] (7 responses)

journaling (as used by default on every distro I know) almost never prevents data loss, at least not directly. All that journaling does is make it so that the filesystem metadata makes sense, the metadata may be pointing at garbage data, but you aren't as likely to get the metadata corrupted in such a way that continues use of the filesystem after a failure will corrupt existing data.

And this is relevant to ext4... exactly how?

Posted Dec 1, 2011 3:29 UTC (Thu) by tytso (subscriber, #9993) [Link] (6 responses)

fsync() in combination with a journal will protect against data loss.

But yes, a journal by itself has as its primary feature avoiding long fsck times. One nice thing with ext4 is that fsck times are reduced (typically) by a factor of 7-12 times. So a TB file system that previously took 20-25 minutes might now only take 2-3 minutes.

If you are replicating your data anyway because you're using a cluster file system such as Hadoopfs, and you're confident that your data center has appropriate contingencies that mitigate against a simultaneous data-center wide power loss event (i.e., you have bat, and diesel generators, etc., and you test all of this equipment regularly), then it may be that going without a journal makes sense. You really need to know what you are doing though, and it requires careful design both at the hardware level, the data center level, as well as the storage stack above the local disk file system.

And this is relevant to ext4... exactly how?

Posted Dec 2, 2011 18:55 UTC (Fri) by walex (subscriber, #69836) [Link] (5 responses)

One nice thing with ext4 is that fsck times are reduced (typically) by a factor of 7-12 times. So a TB file system that previously took 20-25 minutes might now only take 2-3 minutes.

That is the case only for fully undamaged filesystems, that is the common case of a periodic filesystem check. I have never seen any reports that the new 'e2fsck' is faster on damaged filesystems too. And since a damaged 1.5TB 'ext3' filesystem was reported take 2 months to 'fsck', even a factor of 10 is not going to help a lot.

And this is relevant to ext4... exactly how?

Posted Dec 2, 2011 19:10 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

I've had to do fsck on multi-TB filesystems after unclean shutdowns, they can take a long time, but time measured in hours (to a couple days for the larger ones). I suspect that if you are taking months, you have some other bottleneck in place as well.

And this is relevant to ext4... exactly how?

Posted Dec 3, 2011 0:40 UTC (Sat) by walex (subscriber, #69836) [Link]

An unclean shutdown is usually not that damaged, which can however happen with a particularly bad unclean shutdown (lots of stuff in flight, for example on a wide RAID) or RAM/disk errors. The report I saw was not for a "enterprise" system with battery, ECC and a redundant storage layer.

And this is relevant to ext4... exactly how?

Posted Dec 2, 2011 21:41 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

This has been wrong for years. As long as your filesystem was built with the uninit_bg option (which it is by default), block groups which have never been used will not need to be fscked either, hugely speeding up passes 2 and 5 (at the very least).

Fill up the fs, even once, and this benefit goes away -- but a *lot* of filesystems sit for years mostly empty. fscking those filesystems is very, very fast these days (I've seen subsecond times for mostly-empty multi-Tb filesystems).

And this is relevant to ext4... exactly how?

Posted Dec 2, 2011 22:45 UTC (Fri) by tytso (subscriber, #9993) [Link] (1 responses)

We could fix things so that as you delete files from a full file system, we reduce the high watermark field for each block group's inode table, which would restore the speedups caused by needing to scan the entire inode table. I haven't bothered to do this, but I'll add it to my todo list. (Or someone can send me a patch; it would be trivial to do this at e2fsck, but we could do it in the kernel, too.)

Not all of the improvements in fsck time come from being able to skip reading portions of the inode table. Extent tree blocks are also far more efficient than indirect blocks, and so that contributes to much of the speed improvements of fsck'ing an ext4 filesystem compared to an ext2 or ext3 file system.

And this is relevant to ext4... exactly how?

Posted Dec 2, 2011 23:35 UTC (Fri) by nix (subscriber, #2304) [Link]

We could fix things so that as you delete files from a full file system, we reduce the high watermark field for each block group's inode table

That seems hard to me. It's easy to tell if you need to increase the high watermark when adding a new file: but when you delete one, how can you tell what to reduce the high watermark to without doing a fairly expensive scan?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 4:25 UTC (Sun) by alankila (guest, #47141) [Link] (21 responses)

Just to get the argument out in the open, what is the basis for making this claim?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 4:38 UTC (Sun) by dlang (guest, #313) [Link] (19 responses)

for one thing, SSDs are write limited and have effectively zero seek time.

journaling writes data twice with the idea being that the first one is to a sequential location that is going to be fast and then the following write will be to the random location

with no seek time, you should be able to write the data to it's final location directly and avoid the second write. All you need to do is to enforce the ordering of the writes and you should be just as safe as with a journal, without the extra overhead.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 4:49 UTC (Sun) by mjg59 (subscriber, #23239) [Link] (9 responses)

Doesn't that assume that you can perform a series of atomic operations that will result in a consistent filesystem? If that's not true then you still need to be able to indicate the beginning of a transaction, the contents of that transaction and the end of it. If all of that hits the journal first then you can play the entire transaction, but if you were doing it directly to the filesystem then a poorly timed crash might hit an inconsistent point in the middle.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 5:05 UTC (Sun) by dlang (guest, #313) [Link] (8 responses)

that's true, but the trade-off is that you avoid writing the data to the journal, and then writing to the journal again to indicate the the transaction is finished.

if what you are writing is metadata, it seems like it shouldn't be that hard, since there isn't that much metadata to be written.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 11:32 UTC (Sun) by tytso (subscriber, #9993) [Link] (6 responses)

The problem is that many file system operations require you to update more than one metadata block. For example, when you move a file from one directory to another, you need to add a directory entry into one directory, and remove a directory entry from another.

Or when you allocate a disk block, you need to modify the block allocation bitmap (or whatever data structure you use to indicate that the block is in use) and then update the data structures which map a particular inode's logical to physical block map.

Without a journal, you can't do this atomically, which means the state of the file system is undefined after a unclean/unexpected shutdown of the OS.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 17:02 UTC (Sun) by kleptog (subscriber, #1183) [Link] (5 responses)

Indeed. If there were an efficient way to guarantee consistency without a journal there'd be a significant market for it, namely in databases. Journals are a well understood and effective way of managing integrity of complicated disk structures. There are other ways, but journaling beats the others on a number of fronts.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 6, 2011 0:40 UTC (Tue) by cmccabe (guest, #60281) [Link] (4 responses)

There is an efficient way to guarantee consistency without a journal. Soft updates. See http://en.wikipedia.org/wiki/Soft_updates. The main disadvantage of soft updates is that the code seems to be more complex.

Soft updates would not work for databases, because database operations often need to be logged "logically" rather than "physically." For example, when you encounter an update statement that modifies every row of the table, you just want to add the update statement itself to the journal, not the contents of every row.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 6, 2011 1:24 UTC (Tue) by tytso (subscriber, #9993) [Link] (3 responses)

The problems with Soft Updates are quite adequately summed up here, by Val Aurora (formerly Henson): http://lwn.net/Articles/339337/

My favorite line from that article is "...and then I turn to page 8 and my head explodes."

The *BSD's didn't get advanced features such as Extended Attribute until some 2 or 3 years after Linux. My theory why is that it required someone as smart as Kirk McKusick to be able to modify UFS with Soft Updates to add support for Extended Attributes and ACL's.

Also, note that because of how Soft Update works, it requires forcing metadata blocks out to disk more frequently than without Soft Updates; it is not free. What's worse, it depends on the disk not reordering write requests, which modern disks do to avoid seeks (in some cases a write can not make it onto the platter in the absence of a Cache Flush request for 5-10 seconds or more). If you disable the HDD's write cacheing, your lose a lot of performance on HDD's; if you leave it enabled (which is the default) your data is not safe.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 11, 2011 10:18 UTC (Sun) by vsrinivas (subscriber, #56913) [Link]

FFS w/ soft updates assumes that drives honor write requests in the order they were dispatched. This is not necessarily the case, weakening the guarantees it means to provide. Also FFS doesn't ever issue what linux calls 'barriers' (on BSD known as device cache flushes or BUF_CMD_FLUSH).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 21, 2011 23:09 UTC (Wed) by GalacticDomin8r (guest, #81935) [Link] (1 responses)

> Also, note that because of how Soft Update works, it requires forcing metadata blocks out to disk more frequently than without Soft Updates

Duh. Can you name a file system with integrity features that doesn't introduce a performance penalty? I thought not. The point is that the Soft Updates method is (far) less overhead than most.

> What's worse, it depends on the disk not reordering write requests

Bald faced lie. The only requirement of SU's is that writes reported as done by disk driver are indeed safely landed in the nonvolatile storage.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 22, 2011 11:32 UTC (Thu) by nix (subscriber, #2304) [Link]

A little civility would be appreciated. Unless you're a minor filesystem deity in pseudonymous disguise, it is reasonable to assume that Ted knows a hell of a lot more about filesystems than you (because he knows a hell of a lot more about filesystems than almost anyone). It's also extremely impolite to accuse someone of lying unless you have proof that what they are saying is not only wrong but maliciously meant. That is very unlikely here.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 17:13 UTC (Sun) by mjg59 (subscriber, #23239) [Link]

The trade-off is that you go from a situation where you can guarantee metadata consistency to one where you can't. SSDs may make the window of inconsistency smaller, but it's still there.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 10:31 UTC (Sun) by alankila (guest, #47141) [Link] (8 responses)

As far as I can tell, the same argument could be made about rotational media. If only there was a way to write things out in atomic chunks that result in FS metadata moving from a valid state to another valid state, journal wouldn't be necessary... The journal doesn't even improve performance (or shouldn't) because its contents must be merged with the on-disk datastructures at some point anyway.

Anyway, isn't btrfs going to give us journal-less but atomic filesystem modification behavior?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 11:43 UTC (Sun) by tytso (subscriber, #9993) [Link] (4 responses)

Actually, in some cases btrfs (or any copy on write, or CoW file system) may require more metadata blocks to be written than a traditional journalling file system design. That's because even though a CoW file system doesn't have a journal, when you update a metadata block, you have to update all of the metadata blocks that point to it.

So if you modify a node at the bottom of the b-tree, you write a new copy of the leaf block, but then you need to write a copy of its parent node with a pointer to the new leaf block, and then you need to write a copy of its grandparent, with a pointer to the new parent node, all the way up to the root of the tree. This also implies that all of these nodes had better be in memory, or you will need to read them into memory before you can write them back out. Which is why CoW file systems tend to be very memory hungry; if you are under a lot of memory pressure because you're running a cloud server, and are trying to keep lots of VM's packed into a server (or are on an EC2 VM where extra memory costs $$$$), good luck to you.

At least in theory, CoW file systems will try to batch multiple file system operations into a single big transaction (just as ext3 will try to batch many file system operations into a single transaction, to try to minimize writes to the journal). But if you have a really fsync()-happy workload, there definitely could be situations where a CoW file system like btrfs or ZFS could end up needing to update more blocks on an SSD than a traditional update-in-place file system with journaling, such as ext3 or XFS.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 12:13 UTC (Mon) by jlokier (guest, #52227) [Link] (3 responses)

I don't know if btrfs works as you describe, but it is certainly possible to implement a CoW filesystem without "writing all the way up the tree". Think about how journals work without requiring updates to the superblocks that point to them. If btrfs doesn't use that, it's an optimisation waiting to happen.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 24, 2011 20:56 UTC (Sat) by rich0 (guest, #55509) [Link] (2 responses)

You can't implement a COW tree without writing all the way up the tree. You write a new node to the tree, so you have to have the tree point to it. You either copy an existing parent node and fix it, or you overwrite it in place. If you do the latter, then you aren't doing COW. If you copy the parent node, then its parent is pointing to the wrong place, all the way up to the root.

I believe Btrfs actually uses a journal, and then updates the tree very 30 seconds. This is a compromise between pure journal-less COW behavior and the memory-hungry behavior described above. So, the tree itself is always in a clean state (if the change propagates to the root then it points to an up-to-date clean tree, and if it doesn't propagate to the root then it points to a stale clean tree), and then the journal can be replayed to catch the last 30 seconds worth of writes.

I believe that the Btfs journal does effectively protect both data and metadata (equivalent to data=ordered). Since data is not overwritten in place you end up with what appears to be atomic writes I think (within a single file only).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 24, 2011 22:17 UTC (Sat) by jlokier (guest, #52227) [Link] (1 responses)

You can't implement a COW tree without writing all the way up the tree. You write a new node to the tree, so you have to have the tree point to it. You either copy an existing parent node and fix it, or you overwrite it in place. If you do the latter, then you aren't doing COW. If you copy the parent node, then its parent is pointing to the wrong place, all the way up to the root.

In fact you can. The simplest illustration: for every tree node currently, allocate 2 on storage, and replace every pointer in a current interior node format with 2 pointers, pointing to the 2 allocated storage nodes. Those 2 storage nodes both contain a 2-bit version number. The one with larger version number (using wraparound comparison) is "current node", and the other is "potential node".

To update a tree node in COW fashion, without writing all the way up the tree on every update, simply locate the tree node's "potential node" partner, and overwrite that in place with a version 1 higher than the existing tree node. The tree is thus updated. It is made atomic using the same methods as needed for a robust journal: if it's a single sector and the medium writes those atomically, or by using a node checksum, or by writing version number at start and end if the medium is sure to write sequentially.

Note I didn't say it made reading any faster :-) (Though with non-seeking media, speed might not be a problem.)

That method is clearly space inefficient and reads slowly (unless you can cache a lot of the node selections). It can be made more efficient in a variety of ways, such as sharing "potential node" space among multiple potential nodes, or having a few pre-allocated pools of "potential node" space which migrate into the explicit tree with a delay - very much like multiple classical journals. One extreme of that strategy is a classical journal, which can be viewed as every tree node having an implicit reference to the same range of locations, any of which might be regarded as containing that node's latest version overriding the explicit tree structure.

You can imagine there a variety of structures with space and behaviour in between a single, flat journal and an explicitly replicated tree of micro-journalled nodes.

The "replay" employed by classical journals also has an analogue: preloading of node selections either on mount, or lazily as parts of the tree are first read in after mounting, potentially updating tree nodes at preload time to reduce the number of pointer traversals on future reads.

The modern trick of "mounted dirty" bits for large block ranges in some filesystems to reduce fsck time, also has a natural analogue: Dirty subtree bits, indicating whether the "potential" pointers (implicit or explicit) need to be followed or can be ignored. Those bits must be set with a barrier in advance of using the pointers, but they don't have to be set again for new updates after that, and can be cleaned in a variety of ways; one of which is the preload mentioned above.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted May 29, 2012 8:49 UTC (Tue) by marcH (subscriber, #57642) [Link]

I'm not 100% sure but I think you just meant:

"You can implement a COW tree without writing all the way up the tree if your tree implements versioning".

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 13:11 UTC (Sun) by dlang (guest, #313) [Link] (2 responses)

SSDs are inherently COW, they can't modify a block in place, but have to copy it (and all the other disk blocks that form up the erase block) to a new erase block.

this is ok with large streaming writes, but horrible with many small writes to the same area of disk.

the journal is many small writes to the same area of disk, exactly the worst case for an SSD

also with rotational media, writing all the block in place requires many seeks before the data can be considered safe, and if you need to write the blocks in a particular order, you may end up seeking back and forth across the disk. with a SSD the order the blocks are written in doesn't affect how long it takes to write them.

by the way, i'm not the OP who said that all journaling filesystems are bad on SSDs, I'm just pointing out some reasons why this could be the case.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 17:39 UTC (Sun) by tytso (subscriber, #9993) [Link] (1 responses)

Flash erase blocks are around a megabyte these days. So all modern SSD's use a Flash Translation Layer (FTL) that allows writes smaller to than an erase block to get written together in a single erase block. So it's simply not true that if you do a small random write of 16k, that the SSD will need to copy all of the other disk blocks that form up the erase block.

This might be the case for cheap MMC or SD cards that are designed for use in digital cameras, but an SSD which is meant for use in a computer will have a much more sophisticated FTL than that.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 19:38 UTC (Sun) by dlang (guest, #313) [Link]

if you have 4K of data that is part of an eraseblock that you modify, that eraseblock now no longer contains valid info, so since it can't overwrite the 4K it will need to write to a new eraseblock.

yes, in theory it could mark that 4k of data as being obsolete and only write new data to a new eraseblock, but that would lead to fragmentation where the disk could have 256 1M chunks, each with 4K of obsolete data in them, and to regain any space it would then need to re-write 255M of data.

given the performance impact of stalling for this long on a write (not the mention the problems you would run into if you didn't have that many blank eraseblocks available), I would assume that if you re-write a 4k chunk, when it writes that data it will re-write the rest of the eraseblock as well so that it can free up the old eraseblock

the flash translation layer lets it mix the logical blocks in the eraseblocks, and the drives probably do something in between the two extremes I listed above (so they probably track a few holes, but not too many)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 1:34 UTC (Mon) by cmccabe (guest, #60281) [Link]

walex said:
> > No journaled filesystem is good for SSDs

alankila said:
> Just to get the argument out in the open, what is the basis
> for making this claim?

Well, SSDs have a limited number of write cycles. With metadata journaling, you're effectively writing all the metadata changes twice instead of once. That will wear out the flash faster. I think a filesystem based on soft updates might do well on SSDs.

Of course the optimal thing would be if the hardware would just expose an actual MTD interface and let us use NilFS or UBIFS. But so far, that shows no signs of happening. The main reason seems to be that Windows is not able to use raw MTD devices, and most SSDs are sold into the traditional Windows desktop market.

Valerie Aurora also wrote an excellent article about the similarities between SSD block remapping layers and log structured filesystems here: http://lwn.net/Articles/353411/

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 2:05 UTC (Wed) by nix (subscriber, #2304) [Link] (7 responses)

rather than allocate single blocks, a filesystem using clusters will allocate them in larger groups

Like FAT, only less forced-by-misdesign. Everything old is new again...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 10:46 UTC (Thu) by trasz (guest, #45786) [Link] (6 responses)

Or rather, like UFS did for over a decade. In UFS, "blocks", which are kind of what's called clusters here, are 32kB by default, and consist of "fragments" - 4kB by default.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 16:36 UTC (Thu) by tytso (subscriber, #9993) [Link] (5 responses)

Which UFS are you talking about? UFS as found in BSD 4.4 and FreeBSD uses a default cluster size of 8k with 1k fragments.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 20:03 UTC (Thu) by trasz (guest, #45786) [Link] (4 responses)

UFS as found in FreeBSD 10 uses 32kB/4kB. Older versions used 16/2kB sizes since, IIRC, FreeBSD 4. See newfs(8) manual page (http://www.freebsd.org/cgi/man.cgi?newfs).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 23:43 UTC (Fri) by walex (subscriber, #69836) [Link] (3 responses)

«UFS as found in FreeBSD 10 uses 32kB/4kB»

That is terrible, becase it means that except for the tail the system enforces a fixed 32KiB read ahead and write behind, rather than an adaptive (or at least tunable) one.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 1:01 UTC (Sat) by walex (subscriber, #69836) [Link] (1 responses)

BTW many years ago I persuaded the original developer of ext to not implement in it the demented BSD FFS idea of large block/small fragment, arguing that adaptive read-ahead and write-behind would give better dynamic performance, and adaptive allocate-ahead (reservations) better contiguity, without the downsides.

Not everything got implemented as I suggested, but at least all the absurd complications of large block/small fragment (for example the page mapping issues) were avoided in Linux, as well as the implied fixed ra/wb/aa.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 11:06 UTC (Sat) by nix (subscriber, #2304) [Link]

But of course we have had page-mapping-related bugs, in the *other* direction, from people building filesystems with sub-page-size blocks. (Support for this case is unavoidable unless you want filesystems not to be portable from machines with a large page size to machines with a small one, but it's still tricky stuff.)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Jan 3, 2012 17:38 UTC (Tue) by jsdyson (guest, #71944) [Link]

Actually, as the author of earlier forms of the FreeBSD readahead/writebehind, I do know that FreeBSD can be very aggressive with larger reads/writes than just the block size. One really big advantage of the FreeBSD buffering is that the length of the queues/pending writes is generally planned to be smaller, thereby avoiding that nasty sluggish feeling (or apparent stopping) that occurs with horribly large pending writes.

bigalloc

Posted Nov 30, 2011 12:36 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (9 responses)

How will this interact with huge, almost randomly written files?

I have an application where what we really want is lots of RAM. But RAM is expensive. We can afford to buy 2TB of RAM, but not 20TB. However we can afford to go quite a lot slower than RAM sometimes so long as our averages are good enough, so our solution is to use SSDs plus RAM, via mmap()

When we're lucky, the page we want is in RAM, we update it, and the kernel lazily writes it back to an SSD whenever. When we're unlucky, the SSD has to retrieve the page we need, which takes longer and of course forces one of the other pages out of cache, in the worst case forcing it to wait for that page to be written first. We can arrange to be somewhat luckier than pure chance would dictate, on average, but we certainly can't make this into a nice linear operation.

Right now, with 4096 byte pages, the performance is... well, we're working on it but it's already surprisingly good. But if bigalloc clusters mean the unit of caching is larger, it seems like bad news for us.

bigalloc

Posted Nov 30, 2011 14:34 UTC (Wed) by Seegras (guest, #20463) [Link] (3 responses)

> But if bigalloc clusters mean the unit of caching is larger, it seems like
> bad news for us.

You're not supposed to make filesystems with bigalloc clusters if you don't want them or if it hampers your performance.

bigalloc

Posted Nov 30, 2011 18:22 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (2 responses)

Ah, OK, I somehow got the idea this was the unavoidable future, rather than another option. Nothing to worry about then. Thanks for pointing that out.

bigalloc

Posted Nov 30, 2011 19:19 UTC (Wed) by jimparis (guest, #38647) [Link] (1 responses)

It also seems that you might not want to use a filesystem at all for that type of application, but instead just mmap a block device directly.

bigalloc

Posted Jun 14, 2012 14:19 UTC (Thu) by Klavs (guest, #10563) [Link]

like varnish does :)

bigalloc

Posted Nov 30, 2011 20:02 UTC (Wed) by iabervon (subscriber, #722) [Link] (1 responses)

My impression is that bigalloc doesn't affect the unit of caching. Rather, it affects the unit of disk block allocation, meaning that pages 0-1023 are adjacent on your SSD and the filesystem metadata specifies that that 4M is in use and the inode has a single disk location to find it, but the pages are still accessed independently.

bigalloc

Posted Nov 30, 2011 21:16 UTC (Wed) by walex (subscriber, #69836) [Link]

Note that 'ext4' supports extents, so files can get allocated with very large contiguous extents already, for example for a 70MB file:

#  du -sm /usr/share/icons/oxygen/icon-theme.cache
69      /usr/share/icons/oxygen/icon-theme.cache
#  filefrag /usr/share/icons/oxygen/icon-theme.cache
/usr/share/icons/oxygen/icon-theme.cache: 1 extent found
#  df -T /usr/share/icons/oxygen/icon-theme.cache
Filesystem    Type   1M-blocks      Used Available Use% Mounted on
/dev/sda3     ext4       25383     12558     11545  53% /

But so far the free space has been tracked in block-sized units, and the new thing seems to change the amount of free space accounted for by each bit in the free space bitmap.

Which means that as surmised the granularity of allocation has changed (for example minimum extent size).

bigalloc

Posted Nov 30, 2011 21:58 UTC (Wed) by cmccabe (guest, #60281) [Link] (2 responses)

> Right now, with 4096 byte pages, the performance is... well, we're working
> on it but it's already surprisingly good. But if bigalloc clusters mean
> the unit of caching is larger, it seems like bad news for us.

mmap is such an elegant faculty, but it lacks a few things. The first is a way to handle I/O errors reasonably. The second is a way to do nonblocking I/O. You can sort of fudge the second point by using mincore(), but it doesn't work that well.

As far as performance goes... SSDs are great at random reads, but small random writes are often not so good. Don't assume that you can write small chunks anywhere on the flash "for free." The firmware has to do a lot of write coalescing to even make that possible, let alone fast.

bigalloc might very well be slower for you IF you have poor locality-- for example, if most data structures are smaller than 4k, and you never access two sequential data structures. If you have bigger data structures, bigalloc could very well end up being faster.

If you have poor locality, you should try reducing readahead in /sys/block/sda/queue/read_ahead_kb or wherever. There's no point reading bytes that you're not going to access.

bigalloc

Posted Dec 1, 2011 16:56 UTC (Thu) by tialaramex (subscriber, #21167) [Link] (1 responses)

On I/O errors: Sure. We even catch some implausible signal like SIGBUS if we write to a previously unallocated block on a full filesystem for example. But in practice other than giving developers like myself a terrible shock when it first happened (a SIGBUS? what the hell did I touch?) such behaviour isn't too troubling for us. If an SSD actually dies, we're out of action for some time no matter what, just as we would be if the RAM failed. We anticipate this happening once in a while, it isn't a reason to give up and go home.

Yes, our locality is fairly poor such that readahead is actively bad news. The data structures which dominate are exactly page-sized. We may end up changing anything from a few bytes to a whole page (and even when we write a whole page we need the old contents to determine the new contents), but the chance we then move on to the linearly next (or previous) page is negligible.

My impression was that readahead would be disabled by suitable incantations of madvise(). Is that wrong? It didn't benchmark as wrong on toy systems, but I would have to check whether we actually re-tested on the big machines.

bigalloc

Posted Dec 1, 2011 20:36 UTC (Thu) by cmccabe (guest, #60281) [Link]

> We even catch some implausible signal like SIGBUS if we write to
> a previously unallocated block on a full filesystem for example.

If I were you, I'd use posix_fallocate to de-sparsify (manifest?) all of the blocks. Then you don't have unpleasant surprises waiting for you later.

> My impression was that readahead would be disabled by suitable
> incantations of madvise(). Is that wrong? It didn't benchmark as wrong on
> toy systems, but I would have to check whether we actually re-tested on
> the big machines.

I looked at mm/filemap.c and found this:

> static void do_sync_mmap_readahead(...) {
> ...
> if (VM_RandomReadHint(vma))
> return;
> ...
> }

So I'm guessing you're safe with MADV_RANDOM. But it might be wise to check the source of the kernel you're using in case something is different in that version.

e2fsprogs

Posted Nov 30, 2011 16:00 UTC (Wed) by corbet (editor, #1) [Link] (2 responses)

My usual luck holds...the upcoming e2fsprogs release mentioned in the article became official moments after my last look at the ext4 mailing list before posting.

e2fsprogs

Posted Nov 30, 2011 19:02 UTC (Wed) by zuki (subscriber, #41808) [Link]

... and the release announcement is very impressive! The amount of new features is staggering.

e2fsprogs

Posted Dec 1, 2011 15:17 UTC (Thu) by obi (guest, #5784) [Link]

Well, you know we wouldn't have releases of anything if you didn't write articles about them first! All progress would halt! ;-)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 21:22 UTC (Wed) by mleu (guest, #73224) [Link] (10 responses)

As a SLES customer reading these (great) LWN articles just gives me the feeling I'm once again on the wrong side of the filesystem situation.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 22:11 UTC (Wed) by jospoortvliet (guest, #33164) [Link]

Don't worry, btrfs won't be the default let alone only option in SLES anytime soon. Surely will be supported and well integrated with tools like Snapper but it's still a choice.

As a matter of fact, there is work going on to allow use of snapper with Ext4 so SUSE ain't jumping ship there.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 0:25 UTC (Sat) by walex (subscriber, #69836) [Link] (8 responses)

Don't worry about SLES. Reiser3 after some initial issues was actually quite robust, and was designed for robustness. If there were issues after the initial shaking down period it was because of the O_PONIES problem that causes so much distrust against ext4 itself, and previously against XFS; but not against JFS because JFS has always had a rather twitchy flushing logic sort of equivalent to the short flushout ext3 has always had.

Indeed ext3 got a good reputation mostly just because even when it did not support barriers it had a very short flushing interval etc. which made it seemingly resilient in many cases to sudden power off even for applications that did not issue fsync(2).

To some extend it is sad that SLES switched to the ext line, but I guess a large part of it was marketing (it is an industry standard) and the sad story with Namesys.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 11:30 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (7 responses)

Did anyone ever fix the ReiserFS tools to the point that you could safely fsck a ReiserFS volume that contained an uncompressed ReiserFS image?

deep recovery and embedded filesystem images

Posted Dec 3, 2011 17:57 UTC (Sat) by walex (subscriber, #69836) [Link] (3 responses)

That's an interesting case. ReiserFS was designed to be very robust in the face of partial data loss, allowing for a reconstruction of the file system metadata from recognizable copies embedded in the files themselves.

Thus the contents of an embedded ReiserFS image will look like lost files from the containing filesystem, if the option to reconstruct metadata is enabled.

Running man reiserfsck is advised before doing recovery on a damaged ReiserFS image. Paying particular attention to the various mentions of --rebuild-tree may be wise.

In other words there is nothing to fix, except a lack of awareness of one of the better features of ReiserFS, or perhaps a lack of a specific warning.

deep recovery and embedded filesystem images

Posted Dec 4, 2011 1:05 UTC (Sun) by nix (subscriber, #2304) [Link] (2 responses)

You're the only person I have ever heard call reiserfsck's propensity to fuse disk images into an unholy union with their containing filesystem a *feature*.

I don't think it was ever any part of reiserfsck's design to "reconstruct[] file system metadata from recognizable copies embedded in the files themselves" because nobody ever does that (how many copies of your inode tables do you have written to files in your filesystem for safety? None, that's right). It's more that reiserfsck --rebuild-tree simply scanned the whole partition for things that looked like btree nodes, and if it found them, it assumed they came from the same filesystem, and not from a completely different filesystem that happened to be merged into it -- there was no per-filesystem identifier in each node or anything like that, so they all got merged together.

This is plainly a design error, but equally plainly not one that would have been as obvious when reiserfs was designed as it is now, when disks were smaller than they are today and virtual machine images much rarer.

If you want some real fun, try a reiserfs filesystem with an ext3 filesystem inside it and another reiserfs filesystem embedded in that. To describe what reiserfsck --rebuild-tree on the outermost filesystem does to the innermost two would require using words insufficiently family-friendly for this site (though it is extremely amusing if you have nothing important on the fs).

Rebuild tree a useful feature with side-effects.

Posted Dec 8, 2011 4:26 UTC (Thu) by gmatht (subscriber, #58961) [Link] (1 responses)

I don't think that merging filesystems was meant to be a feature, but rather the rebuild-tree is useful feature that other fscks don't have.

If someone has stored all their precious photos and media files on a disk, and the metadata is trashed, then rebuilding the tree should get them their files back where a regular fsck wouldn't. I wouldn't trust --rebuild-tree not to add random files at the best of times, for example, I understand that it restores deleted files [1] which you probably don't want to do in a routine fsck. If, on the other hand, you've just found out that all your backups are on write-only media, rebuilding a tree from leaves could save you from losing years of work. It would be even better if it didn't merge partitions, but is still better than nothing if used as a last resort.

I think it would also be better if it encouraged you to rebuild the tree onto an entirely new partition.

[1] http://www.linuxquestions.org/linux/answers/Hardware/Reis...

Rebuild tree a useful feature with side-effects.

Posted Dec 8, 2011 5:35 UTC (Thu) by tytso (subscriber, #9993) [Link]

The fsck for ext2/3/4 doesn't have this feature because it doesn't need it. One of the tradeoffs of using a dynamic inode table (since in reiserfs it is stored as part of the btree) is if you lose the root node of the file system, you have no choice but to search the entire disk looking for nodes that appear to belong to the file system b-tree.

With ext 2/3/4, we have a static inode table. This does have some disadvantages, but the advantage is that it's much more robust against file system damage, since the location of the metadata is much more predictable.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 16:15 UTC (Mon) by nye (subscriber, #51576) [Link] (2 responses)

>Did anyone ever fix the ReiserFS tools to the point that you could safely fsck a ReiserFS volume that contained an uncompressed ReiserFS image?

The existing replies have basically answered this, but just to make it clear:

You could always do that.

Reiserfs *additionally* came with an *option* designed to make a last-ditched attempt at recovering a totally hosed filesystem by looking for any data on the disk that looked like Reiserfs data structures and making its best guess at rebuilding it based on that.

Somehow the FUD brigade latched on to the drawbacks of that feature and conveniently 'forgot' that it was neither the only, nor the default fsck method.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 16:46 UTC (Mon) by jimparis (guest, #38647) [Link] (1 responses)

In my long-ago experience, reiserfsck --fix-fixable did absolutely nothing to improve a broken filesystem, and --rebuild-tree was the only way to get anything out. Maybe you got very lucky?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 14, 2011 12:15 UTC (Wed) by nye (subscriber, #51576) [Link]

>Maybe you got very lucky?

Maybe I did. Or maybe you got unlucky. Most of the people commenting on it though *never tried*; they just heard something bad via hearsay and parrotted it, and that just gets to me.

I wonder if this can be made to work well with SSD eraseblocks

Posted Dec 4, 2011 13:14 UTC (Sun) by dlang (guest, #313) [Link]

If bigalloc can be made to match the eraseblock size of an SSD, align properly, have the I/O for the entire block written at once, etc (and prevent the SSD smarts from confusing things), this could be a huge win for both a durability and performance point of view on a SSD