Improving ext4: bigalloc, inline data, and metadata checksums
Bigalloc
In the early days of Linux, disk drives were still measured in megabytes and filesystems worked with blocks of 1KB to 4KB in size. As this article is being written, terabyte disk drives are not quite as cheap as they recently were, but the fact remains: disk drives have gotten a lot larger, as have the files stored on them. But the ext4 filesystem still deals in 4KB blocks of data. As a result, there are a lot of blocks to keep track of, the associated allocation bitmaps have grown, and the overhead of managing all those blocks is significant.
Raising the filesystem block size in the kernel is a dauntingly difficult task involving major changes to memory management, the page cache, and more. It is not something anybody expects to see happen anytime soon. But there is nothing preventing filesystem implementations from using larger blocks on disk. As of the 3.2 kernel, ext4 will be capable of doing exactly that. The "bigalloc" patch set adds the concept of "block clusters" to the filesystem; rather than allocate single blocks, a filesystem using clusters will allocate them in larger groups. Mapping between these larger blocks and the 4KB blocks seen by the core kernel is handled entirely within the filesystem.
The cluster size to use is set by the system administrator at filesystem creation time (using a development version of e2fsprogs), but it must be a power of two. A 64KB cluster size may make sense in a lot of situations; for a filesystem that holds only very large files, a 1MB cluster size might be the right choice. Needless to say, selecting a large cluster size for a filesystem dominated by small files may lead to a substantial amount of wasted space.
Clustering reduces the space overhead of the block bitmaps and other management data structures. But, as Ted Ts'o documented back in July, it can also increase performance in situations where large files are in use. Block allocation times drop significantly, but file I/O performance also improves in general as the result of reduced on-disk fragmentation. Expect this feature to attract a lot of interest once the 3.2 kernel (and e2fsprogs 1.42) make their way to users.
Inline data
An inode is a data structure describing a single file within a filesystem. For most filesystems, there are actually two types of inode: the filesystem-independent in-kernel variety (represented by struct inode), and the filesystem-specific on-disk version. As a general rule, the kernel cannot manipulate a file in any way until it has a copy of the inode, so inodes, naturally, are the focal point for a lot of block I/O.
In the ext4 filesystem, the size of on-disk inodes can be set when a filesystem is created. The default size is 256 bytes, but the on-disk structure (struct ext4_inode) only requires about half of that space. The remaining space after the ext4_inode structure is normally used to hold extended attributes. Thus, for example, SELinux labels can be found there. On systems where extended attributes are not heavily used, the space between on-disk inode structures may simply go to waste.
Meanwhile, space for file data is allocated in units of blocks, separately from the inode. If a file is very small (and, even on current systems, there are a lot of small files), much of the block used to hold that file will be wasted. If the filesystem is using clustering, the amount of lost space will grow even further, to the point that users may start to complain.
Tao Ma's ext4 inline data patches may change that situation. The idea is quite simple: very small files can be stored directly in the space between inodes without the need to allocate a separate data block at all. On filesystems with 256-byte on-disk inodes, the entire remaining space will be given over to the storage of small files. If the filesystem is built with larger on-disk inodes, only half of the leftover space will be used in this way, leaving space for late-arriving extended attributes that would otherwise be forced out of the inode.
Tao says that, with this patch set applied, the space required to store a kernel tree drops by about 1%, and /usr gets about 3% smaller. The savings on filesystems where clustering is enabled should be somewhat larger, but those have not yet been quantified. There are a number of details to be worked out yet - including e2fsck support and the potential cost of forcing extended attributes to be stored outside of the inode - so this feature is unlikely to be ready for inclusion before 3.4 at the earliest.
Metadata checksumming
Storage devices are not always as reliable as we would like them to be; stories of data corrupted by the hardware are not uncommon. For this reason, people who care about their data make use of technologies like RAID and/or filesystems like Btrfs which can maintain checksums of data and metadata and ensure that nothing has been mangled by the drive. The ext4 filesystem, though, lacks this capability.
Darrick Wong's checksumming patch set does not address the entire problem. Indeed, it risks reinforcing the old jest that filesystem developers don't really care about the data they store as long as the filesystem metadata is correct. This patch set seeks to achieve that latter goal by attaching checksums to the various data structures found on an ext4 filesystem - superblocks, bitmaps, inodes, directory indexes, extent trees, etc. - and verifying that the checksums match the data read from the filesystem later on. A checksum failure can cause the filesystem to fail to mount or, if it happens on a mounted filesystem, remount it read-only and issue pleas for help to the system log.
Darrick makes no mention of any plans to add checksums for data as well. In a number of ways, that would be a bigger set of changes; checksums are relatively easy to add to existing metadata structures, but an entirely new data structure would have to be added to the filesystem to hold data block checksums. The performance impact of full-data checksumming would also be higher. So, while somebody might attack that problem in the future, it does not appear to be on anybody's list at the moment.
The changes to the filesystem are
significant, even for metadata-only checksums,
but the bulk of the work
actually went into e2fsprogs. In particular, e2fsck gains the ability to
check all of those checksums and, in some cases, fix things when the
checksum indicates that there is a problem. Checksumming can be
enabled with mke2fs and toggled with tune2fs. All told, it is a lot of
work, but it should help to improve confidence in the filesystem's
structure. According to Darrick, the overhead of the checksum calculation
and verification is not measurable in most situations. This feature has
not drawn a lot of comments this time around, and may be close to ready for
inclusion, but nobody has yet said when that might happen.
Index entries for this article | |
---|---|
Kernel | Filesystems/ext4 |
Posted Nov 29, 2011 23:44 UTC (Tue)
by pr1268 (subscriber, #24648)
[Link] (103 responses)
> It is solid and reliable I'm not so sure about that; I've suffered data corruption in a stand-alone ext4 filesystem with a bunch of OGG Vorbis files—occasionally ogginfo(1) reports corrupt OGG files. Fortunately I have backups. I'm going back to ext3 at the soonest opportunity. FWIW I'm using a multi-disk LVM setup—I wonder if that's the culprit?
Posted Nov 29, 2011 23:49 UTC (Tue)
by yoe (guest, #25743)
[Link] (10 responses)
Try to nail down whether your problem is LVM, one of your disks dying, or ext4, before changing things like that. Otherwise you'll be debugging for a long time to come...
Posted Nov 30, 2011 4:49 UTC (Wed)
by ringerc (subscriber, #3071)
[Link] (8 responses)
I recently had a batch of disks in a backup server start eating data because of a HDD firmware bug. It does happen.
Posted Nov 30, 2011 8:29 UTC (Wed)
by hmh (subscriber, #3838)
[Link]
Posted Nov 30, 2011 12:02 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link] (6 responses)
Screams RAM or cache fault to me. It's that word "occasionally" which does it. Bugs tend to be systematic. Their symptoms may be bizarre, but there's usually something consistent about them, because after all someone has specifically (albeit accidentally) programmed the computer to do exactly whatever it was that happened. Even the most subtle Heisenbug will have some sort of pattern to it.
Yoe should be especially suspicious of their "blame ext4" idea if this "corruption" is one or two corrupted bits rather than big holes in the file. Disks don't tend to lose individual bits. Disk controllers don't tend to lose individual bits. Filesystems don't tend to lose individual bits. These things all deal in blocks, when they lose something they will tend to lose really big pieces.
But dying RAM, heat-damaged CPU cache, or a serial link with too little margin of error, those lose bits. Those are the places to look when something mysteriously becomes slightly corrupted.
Low-level network protocols often lose bits. But because there are checksums in so many layers you won't usually see this in a production system even when someone has goofed (e.g. not implemented Ethernet checksums at all) because the other layers act as a safety net.
Posted Nov 30, 2011 12:44 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link]
Posted Nov 30, 2011 15:42 UTC (Wed)
by pr1268 (subscriber, #24648)
[Link] (4 responses)
The corruption I was getting was not merely "one or two bits" but rather a hole in the OGG file big enough to cause an audible "skip" in the playback—large enough to believe it was a whole block disappearing from the filesystem. Also, the discussion of write barriers came up; I have noatime,data=ordered,barrier=1 as mount options for this filesystem in my /etc/fstab file—I'm pretty sure those are the "safe" defaults (but I could be wrong).
Posted Nov 30, 2011 17:31 UTC (Wed)
by rillian (subscriber, #11344)
[Link] (3 responses)
That means that a few bit errors will cause the decoder to drop ~100 ms of audio at a time, and tools will report this as 'hole in data'. To see if it's disk or filesystem corruption, look for pages of zeros in a hexdump around where the glitch is.
Posted Dec 1, 2011 3:17 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
Posted Dec 1, 2011 10:07 UTC (Thu)
by mpr22 (subscriber, #60784)
[Link]
Posted Dec 1, 2011 18:25 UTC (Thu)
by rillian (subscriber, #11344)
[Link]
The idea with the Ogg checksums was to protect the listener's ears (and possibly speakers) from corrupt output. It's also nice to have a built-in check for data corruption in your archives, which is working as designed here.
What you said is valid for video, because we're more tolerant of high frequency visual noise, and because the extra data dimensions and longer prediction intervals mean you can get more useful information from a corrupt frame than you do with audio. Making the checksum optional for the packet data is one of the things we'd do if we ever revised the Ogg format.
Posted Dec 2, 2011 22:09 UTC (Fri)
by giraffedata (guest, #1954)
[Link]
That's not shotgun debugging (and not what the Jargon File calls it). The salient property of a shotgun isn't that it makes radical changes, but that it makes widespread changes. So you hit what you want to hit without aiming at it.
Shotgun debugging is trying lots of little things, none that you particularly believe will fix the bug.
In this case, the fallback to ext3 is fairly well targeted: the problem came contemporaneously with this one major and known change to the system, so it's not unreasonable to try undoing that change.
The other comments give good reason to believe this is not the best way forward, but it isn't because it's shotgun debugging.
There must be a term for the debugging mistake in which you give too much weight to the one recent change you know about in the area; I don't know what it is. (I've lost count of how many people accused me of breaking their Windows system because after I used it, there was a Putty icon on the desktop and something broke soon after that).
Posted Nov 30, 2011 0:00 UTC (Wed)
by bpepple (subscriber, #50705)
[Link] (17 responses)
Posted Nov 30, 2011 0:32 UTC (Wed)
by pr1268 (subscriber, #24648)
[Link] (16 responses)
Thanks for the pointer, and thanks also to yoe's reply above. But, my music collection (currently over 10,000 files) has existed for almost four years, ever since I converted the entire collection from MP3 to OGG (via a homemade script which took about a week to run).1 (I've never converted from FLAC to OGG, although I do have a couple of FLAC files.) I never noticed any corruption in the OGG files until a few months ago, shortly after I did a clean OS re-install (Slackware 13.37) on bare disks (including copying the music files)2. I'm all too eager to blame the corruption on ext4 and/or LVM, since those were the only two things that changed immediately prior to the corruption, but you both bring up a good point that maybe I should dig a little deeper into finding the root cause before I jump to conclusions. 1 I've had this collection of (legitimately acquired) songs for years prior, even having it on NTFS back in my Win2000/XP days. I abandoned Windows (including NTFS) in August 2004, and my music collection was entirely MP3 format (at 320 kbit) since I got my first 200GB hard disk. After seeing the benefits of the OGG Vorbis format, I decided to switch. 2 I have four physical disks (volumes) in which I've set up PV set spanning across all disks for fast I/O performance. I'm not totally impressed at the performance—it is somewhat faster—but that's a whole other discussion.
Posted Nov 30, 2011 0:57 UTC (Wed)
by yokem_55 (guest, #10498)
[Link] (9 responses)
Posted Nov 30, 2011 2:11 UTC (Wed)
by dskoll (subscriber, #1630)
[Link] (5 responses)
I also had a very nasty experience with ext4. A server I built using ext4 suffered a power failure and the file system was completely toast after it powered back up. fsck threw hundreds of errors and I ended up rebuilding from scratch.
I have no idea if ext4 was the cause of the problem, but I've never seen that on an ext3 system. I am very nervous... possibly irrationally so, but I think I'll stick to ext3 for now.
Posted Nov 30, 2011 4:52 UTC (Wed)
by ringerc (subscriber, #3071)
[Link] (3 responses)
Write-back caching on volatile storage without careful use of write barriers and forced flushes *will* cause severe data corruption if the storage is cleared due to (eg) unexpected power loss.
Posted Nov 30, 2011 9:00 UTC (Wed)
by Cato (guest, #7643)
[Link] (2 responses)
Posted Nov 30, 2011 12:40 UTC (Wed)
by dskoll (subscriber, #1630)
[Link] (1 responses)
My system was using Linux Software RAID, so there wasn't a cheap RAID controller in the mix. You could be correct about the hard drives doing caching, but it seems odd that I've never seen this with ext3 but did with ext4. I am still hoping it was simply bad luck, bad timing, and writeback caching... but I'm also still pretty nervous.
Posted Nov 30, 2011 12:50 UTC (Wed)
by dskoll (subscriber, #1630)
[Link]
Ah... reading http://serverfault.com/questions/279571/lvm-dangers-and-caveats makes me think I was a victim of LVM and no write barriers. I've followed the suggestions in that article. So maybe I'll give ext4 another try.
Posted Nov 30, 2011 20:20 UTC (Wed)
by walex (subscriber, #69836)
[Link]
It is a very well known issue usually involving unaware sysadms and cheating developers.
Posted Nov 30, 2011 2:13 UTC (Wed)
by nix (subscriber, #2304)
[Link] (2 responses)
I'm quite willing to believe that bad RAM and the like can cause data corruption, but even when I was running ext4 on a machine with RAM so bad that you couldn't md5sum a 10Mb file three times and get the same answer thrice, I had no serious corruption (though it is true that I didn't engage in major file writing while the RAM was that bad, and I did get the occasional instances of bitflips in the page cache, and oopses every day or so).
Posted Nov 30, 2011 12:49 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link] (1 responses)
To someone who isn't looking for RAM/ cache issues as the root cause, those often look just like filesystem corruption of whatever kind. They try to open a file, get an error saying it's corrupted. Or they run a program and it mysteriously crashes.
If you _already know_ you have bad RAM, then you say "Ha, bitflip in page cache" and maybe you flush a cache and try again. But if you've already begun to harbour doubts about Seagate disks, or Dell RAID controllers, or XFS then of course that's what you will tend to blame for the problem.
Posted Dec 1, 2011 19:23 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Rare bitflips are normally going to be harmless or fixed up by e2fsck, one would hope. There may be places where a single bitflip, written back, toasts the fs, but I'd hope not. (The various fs fuzzing tools would probably have helped comb those out.)
Posted Nov 30, 2011 10:19 UTC (Wed)
by Trou.fr (subscriber, #26289)
[Link] (5 responses)
Posted Nov 30, 2011 15:35 UTC (Wed)
by pr1268 (subscriber, #24648)
[Link] (4 responses)
From that article: Mp3 to Ogg Ogg -q6 was required to achieve transparency against the (high-quality) mp3 with difficult samples. I used -q8 (or higher) when transcoding with oggenc(1); I've done extensive testing by transcoding back-and-forth to different formats (including RIFF WAV) and have never noticed any decrease in audio quality or frequency response, even when measured with a spectrum analyzer. I do value your point, though.
Posted Dec 1, 2011 22:54 UTC (Thu)
by job (guest, #670)
[Link] (3 responses)
Posted Dec 10, 2011 1:04 UTC (Sat)
by ibukanov (guest, #3942)
[Link] (2 responses)
Posted Dec 10, 2011 15:20 UTC (Sat)
by corbet (editor, #1)
[Link] (1 responses)
Posted Dec 12, 2011 2:54 UTC (Mon)
by jimparis (guest, #38647)
[Link]
You can't replace missing information, but you could still make something that sounds better -- in a subjective sense. For example, maybe the mp3 has harsh artifacts at higher frequencies that the ogg encoder would remove.
It could apply to lossy image transformations too. Consider this sample set of images.
An initial image is pixelated (lossy), and that result is then blurred (also lossy). Some might argue that the final result looks better than the intermediate one, even though all it did was throw away more information.
But I do agree that this is off-topic, and that such improvement is probably rare in practice.
Posted Nov 30, 2011 8:50 UTC (Wed)
by ebirdie (guest, #512)
[Link]
Lesson learned: it pays to keep data on smaller volumes although it is very very tempting to stuff data to ever bigger volumes and postpone the headache in splitting and managing smaller volumes.
Posted Nov 30, 2011 8:57 UTC (Wed)
by Cato (guest, #7643)
[Link]
This may help: http://serverfault.com/questions/279571/lvm-dangers-and-c...
Posted Nov 30, 2011 21:01 UTC (Wed)
by walex (subscriber, #69836)
[Link] (58 responses)
But the main issue is not that, by all accounts 'ext4' is quite reliable (when on a properly setup storage system and properly used by applications).
The big problem with 'ext4' is that its only reason to be is to allow Red Hat customers to upgrade in place existing systems, and what Red Hat wants, Red Hat gets (also because they usually pay for that and the community is very grateful).
Other than that new "typical" systems almost only JFS and XFS make sense (and perhaps in the distant future BTRFS).
In particular JFS should have been the "default" Linux filesystem instead of ext[23] for a long time. Not making JFS the default was probably the single worst strategic decision for Linux (but it can be argued that letting GKH near the kernel was even worse). JFS is still probably (by a significant margin) the best ''all-rounder'' filesystem (XFS beats it in performance only on very parallel large workloads, and it is way more complex, and JFS has two uncommon but amazingly useful special features).
Sure it was very convenient to let people (in particular Red Hat customers) upgrade in place from 'ext' to 'ext2' to 'ext3' to 'ext4' (each in-place upgrade keeping existing files unchanged and usually with terrible performance), but given that when JFS was introduced the Linux base was growing rapidly, new installations could be expected to outnumber old ones very soon, making that point largely moot.
PS: There are other little known good filesystems, like OCFS2 (which is pretty good in non-clustered mode) and NILFS2 (probably going to be very useful on SSDs), but JFS is amazingly still very good. Reiser4 was also very promising (it seems little known that the main developer of BTRFS was also the main developer of Reiser4). As a pet peeve of mine UDF could have been very promising too, as it was quite well suited to RW media like hard disks too (and the Linux implementation almost worked in RW mode on an ordinary partition), and also to SSDs.
Posted Nov 30, 2011 22:07 UTC (Wed)
by yokem_55 (guest, #10498)
[Link]
Posted Nov 30, 2011 23:12 UTC (Wed)
by Lennie (subscriber, #49641)
[Link]
Posted Dec 1, 2011 0:53 UTC (Thu)
by SLi (subscriber, #53131)
[Link] (37 responses)
The only filesystem, years back, that could have said to outperform ext4 on most counts was ReiserFS 4. Unfortunately on each of the three times I stress tested it I hit different bugs that caused data loss.
Posted Dec 1, 2011 2:03 UTC (Thu)
by dlang (guest, #313)
[Link]
I haven't benchmarked against ext4, but I have done benchmarks with the filesystems prior to it, and I've run into many cases where JFS and XFS are clear winners.
even against ext4, if you have a fileserver situation where you have lots of drives involved, XFS is still likely to be a win, ext4 just doesn't have enough developers/testers with large numbers of disks to work with (this isn't my opinion, it's a statement from Ted Tso in response to someone pointing out where EXT4 doesn't do as well as XFS with a high performance disk array)
Posted Dec 2, 2011 18:52 UTC (Fri)
by walex (subscriber, #69836)
[Link] (35 responses)
JFS or XFS being the preferable filesystem on normal Linux use. Believe me, I've tried them both, benchmarked them both, and on almost all counts ext4 outperforms the two by a really wide margin (note that strictly speaking I'm not comparing the filesystems but their Linux implementations). In addition any failures have tended to be much worse on JFS and XFS than on ext4.
Posted Dec 2, 2011 23:15 UTC (Fri)
by tytso (subscriber, #9993)
[Link] (34 responses)
So benchmarking JFS against file systems that are engineered to be safe against power failures, such as ext4 and XFS, isn't particularly fair. You can disable cache flushes for both ext4 and XFS, but would you really want to run in an unsafe configuration for production servers? And JFS doesn't even have an option for enabling barrier support, so you can't make it run safely without fixing the file system code.
Posted Dec 3, 2011 0:56 UTC (Sat)
by walex (subscriber, #69836)
[Link] (31 responses)
As to JFS and performance and barriers with XFS and ext4:
Posted Dec 3, 2011 1:56 UTC (Sat)
by dlang (guest, #313)
[Link] (27 responses)
Posted Dec 3, 2011 3:06 UTC (Sat)
by raven667 (subscriber, #5198)
[Link] (26 responses)
Posted Dec 3, 2011 6:29 UTC (Sat)
by dlang (guest, #313)
[Link] (25 responses)
it should make barriers very fast so there isn't a big performance hit from leaving them on, but if you disable barriers and think the battery will save you, you are sadly mistaken
Posted Dec 3, 2011 11:05 UTC (Sat)
by nix (subscriber, #2304)
[Link] (24 responses)
If the power is out for months, civilization has probably fallen, and I'll have bigger things to care about than a bit of data loss. Similarly I don't care that battery backup doesn't defend me against people disconnecting the controller or pulling the battery while data is in transit. What other situation does battery backup not defend you against?
Posted Dec 3, 2011 15:39 UTC (Sat)
by dlang (guest, #313)
[Link] (15 responses)
1. writing from the OS to the raid card
2. writing from the raid card to the drives
battery backup on the raid card makes step 2 reliable. this means that if the data is written to the raid card it should be considered as safe as if it was on the actual drives (it's not quite that safe, but close enough)
However, without barriers, the data isn't sent from the OS to the raid card in any predictable pattern. It's sent at the whim of the OS cache flusing algorithm. This can result in some data making it to the raid controller and other data not making it to raid controller if you have an unclean shutdown. If the data is never sent to the raid controller, then the battery there can't do you any good.
With Barriers, the system can enforce that data gets to raid controller in a particular order, and so the only data that would be lost is the data since the last barrier operation was completed.
note that if you are using software raid, things are much uglier as the OS may have written the stripe to one drive and not to another (barriers only work on a single drive, not across drives). this is one of the places where hardware raid is significantly more robust than software raid.
Posted Dec 3, 2011 18:04 UTC (Sat)
by raven667 (subscriber, #5198)
[Link] (14 responses)
Posted Dec 3, 2011 19:31 UTC (Sat)
by dlang (guest, #313)
[Link] (11 responses)
barriers preserve the ordering of writes throughout the entire disk subsystem, so once the filesystem decides that a barrier needs to be at a particular place, going through a layer of LVM (before it supported barriers) would run the risk of the writes getting out of order
with barriers on software raid, the raid layer won't let the writes on a particular disk get out of order, but it doesn't enforce that all writes before the barrier on disk 1 get written before the writes after the barrier on disk 2
Posted Dec 4, 2011 6:17 UTC (Sun)
by raven667 (subscriber, #5198)
[Link] (10 responses)
In any event there is a bright line between how the kernel handles internal data structures and what the hardware does and for storage with battery backed write cache once an IO is posted to the storage it is as good as done so there is no need to ask the storage to commit its blocks in any particular fashion. The only issue is that the kernel issue the IO requests in a responsible manner.
Posted Dec 4, 2011 6:41 UTC (Sun)
by dlang (guest, #313)
[Link] (8 responses)
per the messages earlier in this thread, JFS does not, for a long time (even after it was the default in Fedora), LVM did not.
so barriers actually working correctly is relatively new (and very recently they have found more efficient ways to enforce ordering than the older version of barriers.
Posted Dec 4, 2011 11:24 UTC (Sun)
by tytso (subscriber, #9993)
[Link]
It shouldn't be that hard to add support, but no one is doing any development work on it.
Posted Dec 4, 2011 16:26 UTC (Sun)
by rahulsundaram (subscriber, #21946)
[Link] (6 responses)
Posted Dec 4, 2011 16:50 UTC (Sun)
by dlang (guest, #313)
[Link] (5 responses)
Fedora has actually been rather limited in it's support of various filesystems. The kernel supports the different filesystems, but the installer hasn't given you the option of using XFS and JFS for your main filsystem for example.
Posted Dec 4, 2011 17:41 UTC (Sun)
by rahulsundaram (subscriber, #21946)
[Link] (4 responses)
"JFS does not, for a long time (even after it was the default in Fedora)"
You are inaccurate about your claim on the installer as well. XFS is a standard option in Fedora for several releases ever since Red Hat hired Eric Sandeen from SGI to maintain it (and help develop Ext4). JFS is a non-standard option.
Posted Dec 4, 2011 19:22 UTC (Sun)
by dlang (guest, #313)
[Link] (3 responses)
re: XFS, I've been using linux since '94, so XFS support in the installer is very recent :-)
I haven't been using Fedora for quite a while, my experiance to RedHat distros is mostly RHEL (and CentOS), which lag behind. I believe that RHEL5 still didn't support XFS in the installer
Posted Dec 4, 2011 19:53 UTC (Sun)
by rahulsundaram (subscriber, #21946)
[Link]
http://fedoraproject.org/wiki/Releases/10/Beta/ReleaseNot...
That is early 2008. RHEL 6 has xfs support as a add-on subscription and is supported within the installer as well IIRC.
Posted Dec 5, 2011 16:15 UTC (Mon)
by wookey (guest, #5501)
[Link] (1 responses)
(I parsed it the way rahulsundaram did too - it's not clear).
Posted Dec 5, 2011 16:59 UTC (Mon)
by dlang (guest, #313)
[Link]
Posted Jan 30, 2012 8:50 UTC (Mon)
by sbergman27 (guest, #10767)
[Link]
Posted Dec 8, 2011 17:54 UTC (Thu)
by nye (subscriber, #51576)
[Link] (1 responses)
Surely what you're describing is a cache flush, not a barrier?
A barrier is intended to control the *order* in which two pieces of data are written, not when or even *if* they're written. A barrier *could* be implemented by issuing a cache flush in between writes (maybe this is what's commonly done in practice?) but in that case you're getting slightly more than you asked for (ie. you're getting durability of the first write), with a corresponding performance impact.
Posted Dec 8, 2011 23:24 UTC (Thu)
by raven667 (subscriber, #5198)
[Link]
Posted Dec 12, 2011 12:01 UTC (Mon)
by jlokier (guest, #52227)
[Link] (7 responses)
Some battery-backed disk write caches can commit the RAM to flash storage or something else, on battery power, in the event that the power supply is removed for a long time. These systems don't need a large battery and provide stronger long-term guarantees.
Even ignoring ext3's no barrier default, and LVM missing them for ages, there is the kernel I/O queue (elevator) which can reorder requests. If the filesystem issues barrier requests, the elevator will send writes to the storage device in the correct order. If you turn off barriers in the filesystem when mounting, the kernel elevator is free to send writes out of order; then after a system crash, the system recovery will find inconsistent data from the storage unit. This can happen even after a normal crash such as a kernel panic or hard-reboot, no power loss required.
Whether that can happen when you tell the filesystem not to bother with barriers depends on the filesystem's implementation. To be honest, I don't know how ext3/4, xfs, btrfs etc. behave in that case. I always use barriers :-)
Posted Dec 12, 2011 15:40 UTC (Mon)
by andresfreund (subscriber, #69562)
[Link] (6 responses)
Posted Dec 12, 2011 18:14 UTC (Mon)
by dlang (guest, #313)
[Link] (5 responses)
there is no modern filesystem that waits for the data to be written before proceeding. Every single filesystem out there will allow it's writes to be cached and actually written out later (in some cases, this can be _much_ later)
when the OS finally gets around to writing the data out, it has no idea what the application (or filesystem) cares about, unless there are barriers issued to tell the OS that 'these writes must happen before these other writes'
Posted Dec 12, 2011 18:15 UTC (Mon)
by andresfreund (subscriber, #69562)
[Link] (4 responses)
Posted Dec 12, 2011 18:39 UTC (Mon)
by dlang (guest, #313)
[Link] (3 responses)
it actually doesn't stop processing requests and wait for the confirmation from the disk, it issues a barrier to tell the rest of the storage stack not to reorder around that point and goes on to process the next requrest and get it in flight.
Posted Dec 12, 2011 18:53 UTC (Mon)
by andresfreund (subscriber, #69562)
[Link] (2 responses)
It worked a littlebit more like you describe before 2.6.37 but back then it waited if barriers were disabled.
Posted Dec 13, 2011 13:35 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Dec 13, 2011 13:38 UTC (Tue)
by andresfreund (subscriber, #69562)
[Link]
Posted Dec 3, 2011 11:00 UTC (Sat)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Dec 3, 2011 18:06 UTC (Sat)
by raven667 (subscriber, #5198)
[Link]
Posted Dec 3, 2011 20:33 UTC (Sat)
by tytso (subscriber, #9993)
[Link]
ext3 was first supported by RHEL as of RHEL 2 which was released May 2003 --- and as you can see from the dates above, we had developers working at a wide range of companies, thus making it a communuty-supported distribution, long before Red Hat supported ext3 in their RHEL product. In contrast, most of the reiserfs developers worked at Namesys (with a one or two exceptions, most notably Chris Mason when he was at SuSE), and most of the XFS developers worked at SGI.
Posted Dec 5, 2011 16:29 UTC (Mon)
by wookey (guest, #5501)
[Link] (1 responses)
When I managed to repair them I found that many files had big blocks of zeros in them - essentially anything that was in the journal and had not been written. Up to that point I had naively thought that the point of the journal was to keep actual data, not just filesystem metadata. Files that have been 'repaired' by being silently filled with big chunks of zeros did not impress me.
So I now believe that XFS is/was good, but only on properly UPSed servers. Am I wrong about that?
Posted Dec 5, 2011 17:03 UTC (Mon)
by dlang (guest, #313)
[Link]
XFS caches more stuff than ext does, so a crash looses more stuff.
so XFS or ext* with barriers disabled is not good to use, For a long time, running these things on top of LVM had the side effect of disabling barriers, it's only recently that LVM gained the ability to support them
JFS is not good to use (as it doesn't have barriers at all)
note that when XFS is designed to be safe, that doesn't mean that it won't loose data, just that the metadata will not be corrupt.
the only way to not loose data in a crash/power failure is to do no buffering at all, and that will absolutely kill your performance (and we are talking hundreds of times slower, not just a few percentage points)
Posted Dec 1, 2011 2:58 UTC (Thu)
by tytso (subscriber, #9993)
[Link] (2 responses)
JFS was a very good file system, and at the time when it was released, it certainly was better than ext3. But there's a lot more to having a successful open source project beyond having the best technology. The fact that ext2 was well understood, and had a mature set of file system utilities, including tools like "debugfs", are one of the things that do make a huge difference towards people accepting the technology.
At this point, though, ext4 has a number of features which JFS lacks, including delayed allocation, fallocate, punch, and TRIM/discard support. These are all features which I'm sure JFS would have developed if it still had a development community, but when IBM decided to defund the project, there were few or no developers who were not IBM'ers, and so the project stalled out.
---
People who upgrade in place from ext3 to ext4 will see roughly half the performance increase compared to doing a backup, reformat to ext4, and restore operation. But they *do* see a performance increase if they do an upgrade-in-place operation. In fact, even if they don't upgrade the file system image, and use ext4 to mount an ext2 file system image, they will see some performance improvement. So this gives them flexibility, which from a system administrator's point of view, is very, very important!
---
Finally, I find it interesting that you consider OCFS2 "pretty good" in non-clustered mode. OCFS2 is a fork of the ext3 code base[1] (it even uses fs/jbd and now fs/jbd2) with support added for clustered operation, and with support for extents (which ext4 has as well, of course). It doesn't have delayed allocation. But ext4 will be better than ocfs2 in non-clustered mode, simply because it's been optimized for it. The fact that you seem to think OCFS2 to be "pretty good", while you don't seem to think much about ext4 makes me wondered if you have some pretty strong biases against the ext[234] file system family.
[1] Ocfs2progs is also a fork of e2fsprogs. Which they did with my blessing, BTW. I'm glad to see that the code that has come out of the ext[234] project have been useful in so many places. Heck, parts of the e2fsprogs (the UUID library, which I relicensed to BSD for Apple's benefit) can be found in Mac OS X! :-)
Posted Dec 1, 2011 20:25 UTC (Thu)
by sniper (guest, #13219)
[Link] (1 responses)
ocfs2 is not a fork of ext3 and neither is ocfs2-tools a fork of e2fsprogs. But both have benefited a _lot_ from ext3. In some instances, we copied code (non-indexed dir layout). In some instances, we used a different approach because of collective experience (indexed dir). grep ext3 fs/ocfs2/* for more.
The toolset has a lot more similarities to e2fsprogs. It was modeled after it because it is well designed and to also allow admins to quickly learn it. The tools even use the same parameter names where possible. grep -r e2fsprogs * for more.
BTW, ocfs2 has had bigalloc (aka clusters) since day 1, inline-data since 2.6.24 and metadata checksums since 2.6.29. Yes, it does not have delayed allocations.
Posted Apr 13, 2012 19:30 UTC (Fri)
by fragmede (guest, #50925)
[Link]
LVM snapshots are a joke if you have *lots* of snapshots, though I haven't looked at btrfs snapshots since it became production ready.
Posted Dec 1, 2011 3:22 UTC (Thu)
by tytso (subscriber, #9993)
[Link]
At the time when I started working on ext4, XFS developers were all mostly still working for SGI, so there was a similar problem with the distributions not having anyone who could support or debugfs XFS problems. This has changed more recently, as more and more XFS developers have left (volunteraliy or involuntarily) SGI and joined companies such as Red Hat. XFS has also improved its small file performance, which was something it didn't do particularly well simply because SGI didn't optimize for that; its sweet spot was and still is really large files on huge RAID arrays.
One of the reasons why I felt it was necessary to work on ext4 was that everyone I talked to who had created a file system before in the industry, whether it was GPFS (IBM's cluster file system), or Digital Unix's advfs, or Sun's ZFS, gave estimates of somewhere between 50 to 200 person years worth of effort before the file system was "ready". Even if we assume that open source development practices would make development go twice as fast, and if we ignore the high end of the range because cluster file systems are hard, I was skeptical it would get done in two years (which was the original estimate) given the number of developers it was likely to attract. Given that btrfs started at the beginning of 2007, and here we are almost at 2012, I'd say my fears were justified.
At this point, I'm actually finding that ext4 has found a second life as a server file system in large cloud data centers. It turns out that if you don't need the fancy-shamcy features that Copy-on-Write file systems give you, they aren't free. In particular, ZFS has truly a prodigious appetite for memory, and one of the things about cloud servers is that in order for them to make economic sense, you try to pack as many jobs or VM's on them, so they are constantly under memory pressure. We've done some further optimizations so that ext4 performs much better when under memory pressure, and I suspect at this point that in a cloud setting, using a CoW file system may simply not make sense.
Once btrfs is ready for some serious benchmarking, it would be interesting to benchmark it under serious memory pressure, and see how well it performs. Previous CoW file systems, such as BSD's lfs two decades ago, and ZFS more recently, have needed a lot of memory to cache metadata blocks, and it will be interesting to see if btrfs has similar issues.
Posted Dec 1, 2011 19:36 UTC (Thu)
by nix (subscriber, #2304)
[Link] (13 responses)
I also see that I was making some sort of horrible mistake by installing ext4 on all my newer systems, but you never make clear what that mistake might have been.
I've been wracking my brains and I can't think of one thing Greg has done that has come to public knowledge and could be considered bad. So this looks like groundless personal animosity to me.
Posted Dec 1, 2011 19:41 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link]
Posted Dec 2, 2011 11:35 UTC (Fri)
by alankila (guest, #47141)
[Link] (5 responses)
Posted Dec 2, 2011 18:40 UTC (Fri)
by nix (subscriber, #2304)
[Link]
(Yes, I read the release notes, so didn't fall into these traps, but FFS, at least the latter problem was trivial to work around -- one line in the makefile to drop a symlink in /sbin -- and they just didn't bother.)
Posted Dec 2, 2011 23:40 UTC (Fri)
by walex (subscriber, #69836)
[Link] (3 responses)
As to udev some people dislike smarmy shysters who replace well designed working subsystems seemingly for the sole reason of making a political landgrab, because the replacement has both more kernel complexity and more userland complexity and less stability. The key features of devfs were that it would populate automatically /dev from the kernel with basic device files (major, minor) and then use a very simple userland daemon to add extra aliases as required. It turns out that after several attempts to get it to work udev adds to /sys from inside the kernel exactly the same information, so there has been no migration of functionality from kernel to userspace: And the userland part is also far more complex and unstable than devfsd ever was (for example devfs did not require cold start). And udev is just the most shining example of a series of similar poor decisions (which however seem to have been improving a bit with time).
Posted Dec 3, 2011 3:16 UTC (Sat)
by raven667 (subscriber, #5198)
[Link] (1 responses)
Posted Dec 3, 2011 11:07 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted Dec 3, 2011 4:04 UTC (Sat)
by alankila (guest, #47141)
[Link]
Posted Dec 3, 2011 0:12 UTC (Sat)
by walex (subscriber, #69836)
[Link] (5 responses)
«tytso wasn't working for RH when ext4 started up, and still isn't working for them now. So their influence must be more subtle. » Quite irrelevant: a lot of file system were somebody's hobby file systems, but they did not achieve prominence and instant integration into mainline even if rather alpha, and RedHat did not spend enormous amounts of resources quality assuring them to make them production ready either, and quality assurance is a pretty vital detail for file systems, as the Namesys people discovered. Pointing to tytso is just misleading. Also because ext4 really was seeded by Lustre people before tytso became active on it in his role as ext3 curator (and in 2005, which is 5 years later than when JFS became available). Similarly for BTRFS, it has been initiated by Oracle (who have an ext3 installed base), but its main appeal is still as the next inplace upgrade on the Red Hat installed base (thus the interest in trialing it in Fedora, where EL candidate stuff is mass tested), even if for once it is not just an extension of the ext line but has some interesting new angles. But considering ext4 on its own is a partial view; one must consider the pre-existing JFS and XFS stability and robustness and performance, and from a technical point of view ext4 is not that interesting (euphemism) and its sole appeal is inplace upgrades, and the widest installed based for that is RedHat, and to a large extent that could have been said of ext3 too.
Posted Dec 3, 2011 0:52 UTC (Sat)
by nix (subscriber, #2304)
[Link] (1 responses)
And if you're claiming that btrfs is effectively RH-controlled merely because RH customers will benefit, then *everything* that happens to Linux must by your bizarre definition be RH-controlled. That's a hell of a conspiracy: so vague that the coconspirators don't even realise they're conspiring!
Posted Apr 13, 2012 19:34 UTC (Fri)
by fragmede (guest, #50925)
[Link]
Posted Dec 3, 2011 19:45 UTC (Sat)
by tytso (subscriber, #9993)
[Link] (2 responses)
But you can't have it both ways. If that code had been in use by paying Lustre companies, then it's hardly alpha code, wouldn't you agree?
And why did the Lustre developers at Clustrefs chose ext3? Because the engineers they hired knew ext3, since it was a community-supported distribution, whereas JFS was controlled by a core team that was all IBM'ers, and hardly anyone outside of IBM was available who knew JFS really well.
But as others have already pointed out, there was no grand conspiracy to pick ext2/3/4 over its competition. It won partially due to its installed base, and partially because of the availability of developers who understood it (and books written about it, etc., etc., etc.) The way you've been writing you seem to think there was some secret cabal (at Red Hat?) that made these decisions, and there was a "mistake" because they didn't chose your favorite file systems.
The reality is that file systems all have trade-offs, and what's good for some people are not so great for others. Take a look at some of the benchmarks at btrfs.boxacle.net; they're a bit old, but they are well done, and they show that across many different workloads at that time (2-3 years ago) there was no one single file system that was the best across all of the different workloads. So anyone who only uses a single workload, or a single hardware configuration, and tries to use that to prove that their favorite file system is the "best" is trying to sell you something, or who is a slashdot kiddie who has a fan-favorite file system. The reality is a lot more complicated than that, and it's not just about performance. (Truth be told, for many/most uses cases, the file system is not the bottleneck.) Issues like availability of engineers to support the file system in a commercial product, the maturity of the userspace support tools, ease of maintainability, etc. are at least as important if not more so.
Posted Dec 3, 2011 20:43 UTC (Sat)
by dlang (guest, #313)
[Link] (1 responses)
Add to this the fact that you did not need to reformat your system to use ext3 when upgrading, and the fact that ext3 became the standard (taking over from ext2, which was the prior standard) is a no-brainer, and no conspiracy.
In those days XFS would outperform ext3, but only in benchmarks on massive disk arrays (which were even more out of people's price ranges at that point then they are today)
XFS was scalable to high-end systems, but it's low-end performance was mediocre
looking at things nowdays, XFS has had a lot of continuous improvement and integration, both improving it's high-end performance and reliability, and improving it's low-end performance without loosing it's scalability. There are also more people, working for more companies supporting it, making it far less of a risk today, with far more in the way of upsides.
JFS has received very little attention after the initial code dump from IBM, and there is now nobody actively maintaining/improving it, so it really isn't a good choice going forward.
reiserfs had some interesting features and performance, but it suffered from some seriously questionably benchmarking (the one that turned me off to it entirely was a spectacular benchmarking test that reiserfs completed in 20 seconds that took several minutes on ext*, but then we discovered that reiserfs defaulted to a 30 second delay before writing everything to disk, so the entire benchmark was complete before any day started getting written to disk, after that I didn't trust anything that they claimed), and a few major problems (the fsck scrambling is a huge one). It was then abandoned by the developer in favor of the future reiserfs4, with improvements that were submitted being rejected as they were going to be part of the new, incompatible filesystem.
ext4 is in large part a new filesystem who's name just happens to be similar to what people are running, but it has now been out for several years, with developers who are responsive to issues, are a diverse set (no vendor lock-in or dependencies) and are willing to say where the filesystem is not the best choice.
btrfs is still under development (the fact that they don't yet have a fsck tool is telling), is making claims that seem too good to be true, and have already run into several cases where they have pathalogical behavior and have had to modify things significantly. I wouldn't trust it for anything other than non-critical personal use for another several years.
as a result, I am currently using XFS for the most part, but once I get a chance to do another round of testing, ext4 will probably join it. I have a number of systems that have significant numbers of disks, so XFS will probably remain in use.
Posted Dec 4, 2011 1:12 UTC (Sun)
by nix (subscriber, #2304)
[Link]
Posted Dec 1, 2011 3:46 UTC (Thu)
by eli (guest, #11265)
[Link] (1 responses)
Posted Dec 6, 2011 1:06 UTC (Tue)
by pr1268 (subscriber, #24648)
[Link]
Thanks for the suggestion; I'll give it a try sometime if/when I find a corrupt OGG file. Just to bring some closure to this discussion, I wish to make a few points: Many thanks to everyone's discussion above; I always learn a lot from the comments here on LWN.
Posted Dec 8, 2011 15:34 UTC (Thu)
by lopgok (guest, #43164)
[Link] (8 responses)
I wrote it when I had a serverworks chipset on my motherboard that corrupted IDE hard drives when DMA was enabled. However, the utility lets me know there is no bit rot in my files.
It can be found at http://jdeifik.com/ , look for 'md5sum a directory tree'. It is GPL3 code. It works independently from the files being checksummed and independently of the file system. I have found flaky disks that passed every other test with this utility.
The other thing that can corrupt files is memory errors. Many new computers do not support ECC memory. If you care about data integrity, you should use ECC memory. Intel has this feature for their server chips (xeons) and AMD has this feature for all ofgf their processors (though not all motherboard makers support it).
Posted Dec 8, 2011 16:24 UTC (Thu)
by nix (subscriber, #2304)
[Link] (7 responses)
ECCRAM is worthwhile, but it is not at all cheap once you factor all that in.
Posted Dec 8, 2011 17:47 UTC (Thu)
by tytso (subscriber, #9993)
[Link] (6 responses)
It's like people who balk at spending an extra $200 to mirror their data, or to provide a hot spare for their RAID array. How much would you be willing to spend to get back your data after you discover it's been vaporized? What kind of chances are you willing to take against that eventuality happen?
It will vary depending on each person, but traditional people are terrible and figuring out cost/benefit tradeoffs.
Posted Dec 8, 2011 19:10 UTC (Thu)
by nix (subscriber, #2304)
[Link] (5 responses)
(Also, last time I tried you couldn't buy a desktop with ECCRAM for love nor money. Servers, sure, but not desktops. So of course all my work stays on the server with battery-backed hardware RAID and ECCRAM, and I just have to hope the desktop doesn't corrupt it in transit.)
Posted Dec 9, 2011 0:57 UTC (Fri)
by tytso (subscriber, #9993)
[Link] (2 responses)
I really like how quickly I can build kernels on this machine. :-)
I'll grant it's not "cheap" in absolute terms, but I've always believed that skimping on a craftsman's tools is false economy.
Posted Dec 9, 2011 7:41 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link]
I have the same machine. Oddly enough, it only supports 12GB of non-ECC memory, at least according to Dell's manual. How does that happen?
(Also, Intel's processor datasheet claims that several hundred gigabytes of either ECC or non-ECC memory should be supported using the integrated memory controller. I wonder why Dell's system supports less.)
Posted Dec 9, 2011 12:40 UTC (Fri)
by nix (subscriber, #2304)
[Link]
EDAC support for my Nehalem systems landed in mainline a couple of years ago but I'll admit to never having looked into how to get it to tell me what errors may have been corrected, so I have no idea how frequent they might be.
(And if it didn't mean dealing with Dell I might consider one of those machines myself...)
Posted Dec 9, 2011 13:53 UTC (Fri)
by james (subscriber, #1325)
[Link] (1 responses)
Even ECC memory isn't that much more expensive: Crucial do a 2x2GB ECC kit for £27 + VAT ($42 in the US) against £19 ($30).
Posted Dec 9, 2011 15:19 UTC (Fri)
by lopgok (guest, #43164)
[Link]
If you buy assembled computers and can't get ECC support without spending big bucks, it is time to switch vendors.
It is true that ECC memory is more expensive and less available than non-ECC memory, but the price difference is around 20% or so, and Newegg and others sell a wide variety of ECC memory. Mainstream memory manufacturers, including Kingston sell ECC memory.
Of course, virtually all server computers come with ECC memory.
Posted Jan 15, 2012 3:45 UTC (Sun)
by sbergman27 (guest, #10767)
[Link]
The very first time we had an power failure, with a UPS with a bad battery, we experienced corruption in several files of those files. Never *ever* *ever* had we experienced such a thing with EXT3. I immediately added nodelalloc as a mount option, and the EXT4 filesystem now seems as resilient as EXT3 ever was. Note that at around the same time as 2.6.30, EXT3 was made less reliable by adding the same 2.6.30 patches to it, and making data=writeback the default journalling mode. So if you do move back to EXT3, make sure to mount with data=journal.
BTW, I've not noted any performance differences mounting EXT4 with nodelalloc. Maybe in a side by side benchmark comparison I'd detect something.
Posted Feb 19, 2013 10:23 UTC (Tue)
by Cato (guest, #7643)
[Link]
You might also like to try ZFS or btrfs - both have enough built-in checksumming that they should detect issues sooner, though in this case Ogg's checksumming is doing that for audio files. With a checksumming FS you could detect whether the corruption is in RAM (seen when writing to file) or on disk (seen when reading from file). ZFS also does periodic scrubbing to validate checksums.
Posted Nov 30, 2011 1:07 UTC (Wed)
by cmccabe (guest, #60281)
[Link] (39 responses)
Especially if mkfs could somehow align the block clusters to the flash page size.
Posted Nov 30, 2011 8:35 UTC (Wed)
by ebirdie (guest, #512)
[Link] (4 responses)
Posted Nov 30, 2011 21:05 UTC (Wed)
by walex (subscriber, #69836)
[Link] (3 responses)
Posted Dec 1, 2011 23:04 UTC (Thu)
by job (guest, #670)
[Link] (2 responses)
Posted Dec 3, 2011 0:34 UTC (Sat)
by walex (subscriber, #69836)
[Link] (1 responses)
I had to completely re-vase a set of virtual machines that were installed by some less thoughtful predecessor into growable VMware virtual disks. Several virtual disks had several hundred thousand extents (measured with filefrag) and a couple had over a million, all mixed up randomly on the real disk. Performance was horrifying (it did not help that there were another two absurd choices in the setup). I ended up with just the mostly readonly filesystem in the VM disk, and all the writable subtrees mounted via NFS from the host machine, which was much faster. In particular during backups, because I could run the backup program (BackupPC based on RSYNC) on the real machine and remote backup is a very high IO load operation, and running it inside the virtual machines on a virtual disk was much much slower.
Posted Dec 6, 2011 21:26 UTC (Tue)
by job (guest, #670)
[Link]
Posted Nov 30, 2011 20:22 UTC (Wed)
by walex (subscriber, #69836)
[Link] (33 responses)
Posted Nov 30, 2011 21:33 UTC (Wed)
by khim (subscriber, #9252)
[Link] (10 responses)
Posted Nov 30, 2011 23:16 UTC (Wed)
by Lennie (subscriber, #49641)
[Link] (9 responses)
Posted Dec 1, 2011 1:01 UTC (Thu)
by SLi (subscriber, #53131)
[Link] (8 responses)
Posted Dec 1, 2011 1:59 UTC (Thu)
by dlang (guest, #313)
[Link] (7 responses)
Posted Dec 1, 2011 3:29 UTC (Thu)
by tytso (subscriber, #9993)
[Link] (6 responses)
But yes, a journal by itself has as its primary feature avoiding long fsck times. One nice thing with ext4 is that fsck times are reduced (typically) by a factor of 7-12 times. So a TB file system that previously took 20-25 minutes might now only take 2-3 minutes.
If you are replicating your data anyway because you're using a cluster file system such as Hadoopfs, and you're confident that your data center has appropriate contingencies that mitigate against a simultaneous data-center wide power loss event (i.e., you have bat, and diesel generators, etc., and you test all of this equipment regularly), then it may be that going without a journal makes sense. You really need to know what you are doing though, and it requires careful design both at the hardware level, the data center level, as well as the storage stack above the local disk file system.
Posted Dec 2, 2011 18:55 UTC (Fri)
by walex (subscriber, #69836)
[Link] (5 responses)
One nice thing with ext4 is that fsck times are reduced (typically) by a factor of 7-12 times. So a TB file system that previously took 20-25 minutes might now only take 2-3 minutes.
Posted Dec 2, 2011 19:10 UTC (Fri)
by dlang (guest, #313)
[Link] (1 responses)
Posted Dec 3, 2011 0:40 UTC (Sat)
by walex (subscriber, #69836)
[Link]
An
Posted Dec 2, 2011 21:41 UTC (Fri)
by nix (subscriber, #2304)
[Link] (2 responses)
Fill up the fs, even once, and this benefit goes away -- but a *lot* of filesystems sit for years mostly empty. fscking those filesystems is very, very fast these days (I've seen subsecond times for mostly-empty multi-Tb filesystems).
Posted Dec 2, 2011 22:45 UTC (Fri)
by tytso (subscriber, #9993)
[Link] (1 responses)
Not all of the improvements in fsck time come from being able to skip reading portions of the inode table. Extent tree blocks are also far more efficient than indirect blocks, and so that contributes to much of the speed improvements of fsck'ing an ext4 filesystem compared to an ext2 or ext3 file system.
Posted Dec 2, 2011 23:35 UTC (Fri)
by nix (subscriber, #2304)
[Link]
Posted Dec 4, 2011 4:25 UTC (Sun)
by alankila (guest, #47141)
[Link] (21 responses)
Posted Dec 4, 2011 4:38 UTC (Sun)
by dlang (guest, #313)
[Link] (19 responses)
journaling writes data twice with the idea being that the first one is to a sequential location that is going to be fast and then the following write will be to the random location
with no seek time, you should be able to write the data to it's final location directly and avoid the second write. All you need to do is to enforce the ordering of the writes and you should be just as safe as with a journal, without the extra overhead.
Posted Dec 4, 2011 4:49 UTC (Sun)
by mjg59 (subscriber, #23239)
[Link] (9 responses)
Posted Dec 4, 2011 5:05 UTC (Sun)
by dlang (guest, #313)
[Link] (8 responses)
if what you are writing is metadata, it seems like it shouldn't be that hard, since there isn't that much metadata to be written.
Posted Dec 4, 2011 11:32 UTC (Sun)
by tytso (subscriber, #9993)
[Link] (6 responses)
Or when you allocate a disk block, you need to modify the block allocation bitmap (or whatever data structure you use to indicate that the block is in use) and then update the data structures which map a particular inode's logical to physical block map.
Without a journal, you can't do this atomically, which means the state of the file system is undefined after a unclean/unexpected shutdown of the OS.
Posted Dec 4, 2011 17:02 UTC (Sun)
by kleptog (subscriber, #1183)
[Link] (5 responses)
Posted Dec 6, 2011 0:40 UTC (Tue)
by cmccabe (guest, #60281)
[Link] (4 responses)
Soft updates would not work for databases, because database operations often need to be logged "logically" rather than "physically." For example, when you encounter an update statement that modifies every row of the table, you just want to add the update statement itself to the journal, not the contents of every row.
Posted Dec 6, 2011 1:24 UTC (Tue)
by tytso (subscriber, #9993)
[Link] (3 responses)
My favorite line from that article is "...and then I turn to page 8 and my head explodes."
The *BSD's didn't get advanced features such as Extended Attribute until some 2 or 3 years after Linux. My theory why is that it required someone as smart as Kirk McKusick to be able to modify UFS with Soft Updates to add support for Extended Attributes and ACL's.
Also, note that because of how Soft Update works, it requires forcing metadata blocks out to disk more frequently than without Soft Updates; it is not free. What's worse, it depends on the disk not reordering write requests, which modern disks do to avoid seeks (in some cases a write can not make it onto the platter in the absence of a Cache Flush request for 5-10 seconds or more). If you disable the HDD's write cacheing, your lose a lot of performance on HDD's; if you leave it enabled (which is the default) your data is not safe.
Posted Dec 11, 2011 10:18 UTC (Sun)
by vsrinivas (subscriber, #56913)
[Link]
Posted Dec 21, 2011 23:09 UTC (Wed)
by GalacticDomin8r (guest, #81935)
[Link] (1 responses)
Duh. Can you name a file system with integrity features that doesn't introduce a performance penalty? I thought not. The point is that the Soft Updates method is (far) less overhead than most.
> What's worse, it depends on the disk not reordering write requests
Bald faced lie. The only requirement of SU's is that writes reported as done by disk driver are indeed safely landed in the nonvolatile storage.
Posted Dec 22, 2011 11:32 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Posted Dec 4, 2011 17:13 UTC (Sun)
by mjg59 (subscriber, #23239)
[Link]
Posted Dec 4, 2011 10:31 UTC (Sun)
by alankila (guest, #47141)
[Link] (8 responses)
Anyway, isn't btrfs going to give us journal-less but atomic filesystem modification behavior?
Posted Dec 4, 2011 11:43 UTC (Sun)
by tytso (subscriber, #9993)
[Link] (4 responses)
So if you modify a node at the bottom of the b-tree, you write a new copy of the leaf block, but then you need to write a copy of its parent node with a pointer to the new leaf block, and then you need to write a copy of its grandparent, with a pointer to the new parent node, all the way up to the root of the tree. This also implies that all of these nodes had better be in memory, or you will need to read them into memory before you can write them back out. Which is why CoW file systems tend to be very memory hungry; if you are under a lot of memory pressure because you're running a cloud server, and are trying to keep lots of VM's packed into a server (or are on an EC2 VM where extra memory costs $$$$), good luck to you.
At least in theory, CoW file systems will try to batch multiple file system operations into a single big transaction (just as ext3 will try to batch many file system operations into a single transaction, to try to minimize writes to the journal). But if you have a really fsync()-happy workload, there definitely could be situations where a CoW file system like btrfs or ZFS could end up needing to update more blocks on an SSD than a traditional update-in-place file system with journaling, such as ext3 or XFS.
Posted Dec 12, 2011 12:13 UTC (Mon)
by jlokier (guest, #52227)
[Link] (3 responses)
Posted Dec 24, 2011 20:56 UTC (Sat)
by rich0 (guest, #55509)
[Link] (2 responses)
I believe Btrfs actually uses a journal, and then updates the tree very 30 seconds. This is a compromise between pure journal-less COW behavior and the memory-hungry behavior described above. So, the tree itself is always in a clean state (if the change propagates to the root then it points to an up-to-date clean tree, and if it doesn't propagate to the root then it points to a stale clean tree), and then the journal can be replayed to catch the last 30 seconds worth of writes.
I believe that the Btfs journal does effectively protect both data and metadata (equivalent to data=ordered). Since data is not overwritten in place you end up with what appears to be atomic writes I think (within a single file only).
Posted Dec 24, 2011 22:17 UTC (Sat)
by jlokier (guest, #52227)
[Link] (1 responses)
In fact you can. The simplest illustration: for every tree node currently, allocate 2 on storage, and replace every pointer in a current interior node format with 2 pointers, pointing to the 2 allocated storage nodes. Those 2 storage nodes both contain a 2-bit version number. The one with larger version number (using wraparound comparison) is "current node", and the other is "potential node". To update a tree node in COW fashion, without writing all the way up the tree on every update, simply locate the tree node's "potential node" partner, and overwrite that in place with a version 1 higher than the existing tree node. The tree is thus updated. It is made atomic using the same methods as needed for a robust journal: if it's a single sector and the medium writes those atomically, or by using a node checksum, or by writing version number at start and end if the medium is sure to write sequentially. Note I didn't say it made reading any faster :-) (Though with non-seeking media, speed might not be a problem.) That method is clearly space inefficient and reads slowly (unless you can cache a lot of the node selections). It can be made more efficient in a variety of ways, such as sharing "potential node" space among multiple potential nodes, or having a few pre-allocated pools of "potential node" space which migrate into the explicit tree with a delay - very much like multiple classical journals. One extreme of that strategy is a classical journal, which can be viewed as every tree node having an implicit reference to the same range of locations, any of which might be regarded as containing that node's latest version overriding the explicit tree structure. You can imagine there a variety of structures with space and behaviour in between a single, flat journal and an explicitly replicated tree of micro-journalled nodes. The "replay" employed by classical journals also has an analogue: preloading of node selections either on mount, or lazily as parts of the tree are first read in after mounting, potentially updating tree nodes at preload time to reduce the number of pointer traversals on future reads. The modern trick of "mounted dirty" bits for large block ranges in some filesystems to reduce fsck time, also has a natural analogue: Dirty subtree bits, indicating whether the "potential" pointers (implicit or explicit) need to be followed or can be ignored. Those bits must be set with a barrier in advance of using the pointers, but they don't have to be set again for new updates after that, and can be cleaned in a variety of ways; one of which is the preload mentioned above.
Posted May 29, 2012 8:49 UTC (Tue)
by marcH (subscriber, #57642)
[Link]
"You can implement a COW tree without writing all the way up the tree if your tree implements versioning".
Posted Dec 4, 2011 13:11 UTC (Sun)
by dlang (guest, #313)
[Link] (2 responses)
this is ok with large streaming writes, but horrible with many small writes to the same area of disk.
the journal is many small writes to the same area of disk, exactly the worst case for an SSD
also with rotational media, writing all the block in place requires many seeks before the data can be considered safe, and if you need to write the blocks in a particular order, you may end up seeking back and forth across the disk. with a SSD the order the blocks are written in doesn't affect how long it takes to write them.
by the way, i'm not the OP who said that all journaling filesystems are bad on SSDs, I'm just pointing out some reasons why this could be the case.
Posted Dec 4, 2011 17:39 UTC (Sun)
by tytso (subscriber, #9993)
[Link] (1 responses)
This might be the case for cheap MMC or SD cards that are designed for use in digital cameras, but an SSD which is meant for use in a computer will have a much more sophisticated FTL than that.
Posted Dec 4, 2011 19:38 UTC (Sun)
by dlang (guest, #313)
[Link]
yes, in theory it could mark that 4k of data as being obsolete and only write new data to a new eraseblock, but that would lead to fragmentation where the disk could have 256 1M chunks, each with 4K of obsolete data in them, and to regain any space it would then need to re-write 255M of data.
given the performance impact of stalling for this long on a write (not the mention the problems you would run into if you didn't have that many blank eraseblocks available), I would assume that if you re-write a 4k chunk, when it writes that data it will re-write the rest of the eraseblock as well so that it can free up the old eraseblock
the flash translation layer lets it mix the logical blocks in the eraseblocks, and the drives probably do something in between the two extremes I listed above (so they probably track a few holes, but not too many)
Posted Dec 5, 2011 1:34 UTC (Mon)
by cmccabe (guest, #60281)
[Link]
alankila said:
Well, SSDs have a limited number of write cycles. With metadata journaling, you're effectively writing all the metadata changes twice instead of once. That will wear out the flash faster. I think a filesystem based on soft updates might do well on SSDs.
Of course the optimal thing would be if the hardware would just expose an actual MTD interface and let us use NilFS or UBIFS. But so far, that shows no signs of happening. The main reason seems to be that Windows is not able to use raw MTD devices, and most SSDs are sold into the traditional Windows desktop market.
Valerie Aurora also wrote an excellent article about the similarities between SSD block remapping layers and log structured filesystems here: http://lwn.net/Articles/353411/
Posted Nov 30, 2011 2:05 UTC (Wed)
by nix (subscriber, #2304)
[Link] (7 responses)
Posted Dec 1, 2011 10:46 UTC (Thu)
by trasz (guest, #45786)
[Link] (6 responses)
Posted Dec 1, 2011 16:36 UTC (Thu)
by tytso (subscriber, #9993)
[Link] (5 responses)
Posted Dec 1, 2011 20:03 UTC (Thu)
by trasz (guest, #45786)
[Link] (4 responses)
Posted Dec 2, 2011 23:43 UTC (Fri)
by walex (subscriber, #69836)
[Link] (3 responses)
«UFS as found in FreeBSD 10 uses 32kB/4kB» That is terrible, becase it means that except for the tail the system enforces a fixed 32KiB read ahead and write behind, rather than an adaptive (or at least tunable) one.
Posted Dec 3, 2011 1:01 UTC (Sat)
by walex (subscriber, #69836)
[Link] (1 responses)
BTW many years ago I persuaded the original developer of ext to not implement in it the demented BSD FFS idea of large block/small fragment, arguing that adaptive read-ahead and write-behind would give better dynamic performance, and adaptive allocate-ahead (reservations) better contiguity, without the downsides. Not everything got implemented as I suggested, but at least all the absurd complications of large block/small fragment (for example the page mapping issues) were avoided in Linux, as well as the implied fixed ra/wb/aa.
Posted Dec 3, 2011 11:06 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted Jan 3, 2012 17:38 UTC (Tue)
by jsdyson (guest, #71944)
[Link]
Posted Nov 30, 2011 12:36 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link] (9 responses)
I have an application where what we really want is lots of RAM. But RAM is expensive. We can afford to buy 2TB of RAM, but not 20TB. However we can afford to go quite a lot slower than RAM sometimes so long as our averages are good enough, so our solution is to use SSDs plus RAM, via mmap()
When we're lucky, the page we want is in RAM, we update it, and the kernel lazily writes it back to an SSD whenever. When we're unlucky, the SSD has to retrieve the page we need, which takes longer and of course forces one of the other pages out of cache, in the worst case forcing it to wait for that page to be written first. We can arrange to be somewhat luckier than pure chance would dictate, on average, but we certainly can't make this into a nice linear operation.
Right now, with 4096 byte pages, the performance is... well, we're working on it but it's already surprisingly good. But if bigalloc clusters mean the unit of caching is larger, it seems like bad news for us.
Posted Nov 30, 2011 14:34 UTC (Wed)
by Seegras (guest, #20463)
[Link] (3 responses)
You're not supposed to make filesystems with bigalloc clusters if you don't want them or if it hampers your performance.
Posted Nov 30, 2011 18:22 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link] (2 responses)
Posted Nov 30, 2011 19:19 UTC (Wed)
by jimparis (guest, #38647)
[Link] (1 responses)
Posted Nov 30, 2011 20:02 UTC (Wed)
by iabervon (subscriber, #722)
[Link] (1 responses)
Posted Nov 30, 2011 21:16 UTC (Wed)
by walex (subscriber, #69836)
[Link]
Note that 'ext4' supports extents, so files can get allocated with very large contiguous extents already, for example for a 70MB file: But so far the free space has been tracked in block-sized units,
and the new thing seems to change the amount of free space accounted
for by each bit in the free space bitmap. Which means that as surmised the granularity of allocation has changed
(for example minimum extent size).
Posted Nov 30, 2011 21:58 UTC (Wed)
by cmccabe (guest, #60281)
[Link] (2 responses)
mmap is such an elegant faculty, but it lacks a few things. The first is a way to handle I/O errors reasonably. The second is a way to do nonblocking I/O. You can sort of fudge the second point by using mincore(), but it doesn't work that well.
As far as performance goes... SSDs are great at random reads, but small random writes are often not so good. Don't assume that you can write small chunks anywhere on the flash "for free." The firmware has to do a lot of write coalescing to even make that possible, let alone fast.
bigalloc might very well be slower for you IF you have poor locality-- for example, if most data structures are smaller than 4k, and you never access two sequential data structures. If you have bigger data structures, bigalloc could very well end up being faster.
If you have poor locality, you should try reducing readahead in /sys/block/sda/queue/read_ahead_kb or wherever. There's no point reading bytes that you're not going to access.
Posted Dec 1, 2011 16:56 UTC (Thu)
by tialaramex (subscriber, #21167)
[Link] (1 responses)
Yes, our locality is fairly poor such that readahead is actively bad news. The data structures which dominate are exactly page-sized. We may end up changing anything from a few bytes to a whole page (and even when we write a whole page we need the old contents to determine the new contents), but the chance we then move on to the linearly next (or previous) page is negligible.
My impression was that readahead would be disabled by suitable incantations of madvise(). Is that wrong? It didn't benchmark as wrong on toy systems, but I would have to check whether we actually re-tested on the big machines.
Posted Dec 1, 2011 20:36 UTC (Thu)
by cmccabe (guest, #60281)
[Link]
If I were you, I'd use posix_fallocate to de-sparsify (manifest?) all of the blocks. Then you don't have unpleasant surprises waiting for you later.
> My impression was that readahead would be disabled by suitable
I looked at mm/filemap.c and found this:
> static void do_sync_mmap_readahead(...) {
So I'm guessing you're safe with MADV_RANDOM. But it might be wise to check the source of the kernel you're using in case something is different in that version.
Posted Nov 30, 2011 16:00 UTC (Wed)
by corbet (editor, #1)
[Link] (2 responses)
Posted Nov 30, 2011 19:02 UTC (Wed)
by zuki (subscriber, #41808)
[Link]
Posted Dec 1, 2011 15:17 UTC (Thu)
by obi (guest, #5784)
[Link]
Posted Nov 30, 2011 21:22 UTC (Wed)
by mleu (guest, #73224)
[Link] (10 responses)
Posted Nov 30, 2011 22:11 UTC (Wed)
by jospoortvliet (guest, #33164)
[Link]
As a matter of fact, there is work going on to allow use of snapper with Ext4 so SUSE ain't jumping ship there.
Posted Dec 3, 2011 0:25 UTC (Sat)
by walex (subscriber, #69836)
[Link] (8 responses)
Don't worry about SLES. Reiser3 after some initial issues was actually quite robust, and was designed for robustness. If there were issues after the initial shaking down period it was because of the O_PONIES problem that causes so much distrust against ext4 itself, and previously against XFS; but not against JFS because JFS has always had a rather twitchy flushing logic sort of equivalent to the short flushout ext3 has always had. Indeed ext3 got a good reputation mostly just because even when it did not support barriers it had a very short flushing interval etc. which made it seemingly resilient in many cases to sudden power off even for applications that did not issue fsync(2). To some extend it is sad that SLES switched to the ext line, but I guess a large part of it was marketing (it is an
Posted Dec 3, 2011 11:30 UTC (Sat)
by mpr22 (subscriber, #60784)
[Link] (7 responses)
Posted Dec 3, 2011 17:57 UTC (Sat)
by walex (subscriber, #69836)
[Link] (3 responses)
That's an interesting case. ReiserFS was designed to be very robust in the face of partial data loss, allowing for a reconstruction of the file system metadata from recognizable copies embedded in the files themselves. Thus the contents of an embedded ReiserFS image will look like lost files from the containing filesystem, if the option to reconstruct metadata is enabled. Running man reiserfsck is advised before doing recovery on a damaged ReiserFS image. Paying particular attention to the various mentions of --rebuild-tree may be wise. In other words there is nothing to fix, except a lack of awareness of one of the better features of ReiserFS, or perhaps a lack of a specific warning.
Posted Dec 4, 2011 1:05 UTC (Sun)
by nix (subscriber, #2304)
[Link] (2 responses)
I don't think it was ever any part of reiserfsck's design to "reconstruct[] file system metadata from recognizable copies embedded in the files themselves" because nobody ever does that (how many copies of your inode tables do you have written to files in your filesystem for safety? None, that's right). It's more that reiserfsck --rebuild-tree simply scanned the whole partition for things that looked like btree nodes, and if it found them, it assumed they came from the same filesystem, and not from a completely different filesystem that happened to be merged into it -- there was no per-filesystem identifier in each node or anything like that, so they all got merged together.
This is plainly a design error, but equally plainly not one that would have been as obvious when reiserfs was designed as it is now, when disks were smaller than they are today and virtual machine images much rarer.
If you want some real fun, try a reiserfs filesystem with an ext3 filesystem inside it and another reiserfs filesystem embedded in that. To describe what reiserfsck --rebuild-tree on the outermost filesystem does to the innermost two would require using words insufficiently family-friendly for this site (though it is extremely amusing if you have nothing important on the fs).
Posted Dec 8, 2011 4:26 UTC (Thu)
by gmatht (subscriber, #58961)
[Link] (1 responses)
If someone has stored all their precious photos and media files on a disk, and the metadata is trashed, then rebuilding the tree should get them their files back where a regular fsck wouldn't. I wouldn't trust --rebuild-tree not to add random files at the best of times, for example, I understand that it restores deleted files [1] which you probably don't want to do in a routine fsck. If, on the other hand, you've just found out that all your backups are on write-only media, rebuilding a tree from leaves could save you from losing years of work. It would be even better if it didn't merge partitions, but is still better than nothing if used as a last resort.
I think it would also be better if it encouraged you to rebuild the tree onto an entirely new partition.
[1] http://www.linuxquestions.org/linux/answers/Hardware/Reis...
Posted Dec 8, 2011 5:35 UTC (Thu)
by tytso (subscriber, #9993)
[Link]
With ext 2/3/4, we have a static inode table. This does have some disadvantages, but the advantage is that it's much more robust against file system damage, since the location of the metadata is much more predictable.
Posted Dec 12, 2011 16:15 UTC (Mon)
by nye (subscriber, #51576)
[Link] (2 responses)
The existing replies have basically answered this, but just to make it clear:
You could always do that.
Reiserfs *additionally* came with an *option* designed to make a last-ditched attempt at recovering a totally hosed filesystem by looking for any data on the disk that looked like Reiserfs data structures and making its best guess at rebuilding it based on that.
Somehow the FUD brigade latched on to the drawbacks of that feature and conveniently 'forgot' that it was neither the only, nor the default fsck method.
Posted Dec 12, 2011 16:46 UTC (Mon)
by jimparis (guest, #38647)
[Link] (1 responses)
Posted Dec 14, 2011 12:15 UTC (Wed)
by nye (subscriber, #51576)
[Link]
Maybe I did. Or maybe you got unlucky. Most of the people commenting on it though *never tried*; they just heard something bad via hearsay and parrotted it, and that just gets to me.
Posted Dec 4, 2011 13:14 UTC (Sun)
by dlang (guest, #313)
[Link]
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
A few wrong bits in a Vorbis stream seem likely to give you more than just "one wrong sample".
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
shotgun debugging
What you're trying to do with moving back to ext3 is what the Jargon File calls shotgun debugging: trying out some radical move in hopes that this will fix your problem.
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
ext4 experience
ext4 experience
ext4 experience
ext4 experience
ext4 experience
ext4 experience
Improving ext4: bigalloc, inline data, and metadata checksums
bitflips
bitflips
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Pretty far off-topic, but: it is a rare situation indeed where the removal of information will improve the fidelity of a signal. One might not be able to hear the difference, but I have a hard time imagining how conversion between lossy formats could do anything but degrade the quality. You can't put back something that the first lossy encoding took out, but you can certainly remove parts of the signal that the first encoding preserved.
Lossy format conversion
Lossy format conversion
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Most well done benchmarks I have seen show them mostly equivalent performance, with XFS leading the group in scalability, JFS pretty good across the field, and 'ext4' just like the previous 'ext's being good only on totally freshly loaded filesystems as it packs newly created files pretty densely, and when there is ample caching (no use of 'O_DIRECT'), and both fresh loading and caching mask its fundamental, BSD FFS derived, downsides.
It is very very easy to do meaningless filesystem benchmarks (the vast majority that I see on LWN and most others are worthless).
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
"..., for a long time (even after it was the default in Fedora), LVM did not"
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
I am rather sure at least ext4 and xfs do it that way.
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
jbd does something similar but I don't want to look it up unless youre really interested.
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
I shouldn't respond to this troll-bait, but nonetheless...
Improving ext4: bigalloc, inline data, and metadata checksums
The big problem with 'ext4' is that its only reason to be is to allow Red Hat customers to upgrade in place existing systems, and what Red Hat wants, Red Hat gets (also because they usually pay for that and the community is very grateful).
Interesting. tytso wasn't working for RH when ext4 started up, and still isn't working for them now. So their influence must be more subtle.
In particular JFS should have been the "default" Linux filesystem instead of ext[23] for a long time. Not making JFS the default was probably the single worst strategic decision for Linux (but it can be argued that letting GKH near the kernel was even worse).
Ah, yeah. Because stable kernels, USB support, mentoring newbies, the driver core, -staging... all these things were bad.
Improving ext4: bigalloc, inline data, and metadata checksums
Also, uhm. Didn't he work for Suse?
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
$ ls -ld /dev/tty9
crw--w---- 1 root tty 4, 9 2011-11-28 14:03 /dev/tty9
$ cat /sys/class/tty/tty9/dev
4:9
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
ext4 is in large part a new filesystem who's name just happens to be similar to what people are running
ext4 is ext3 with a bunch of new extensions (some incompatible): indeed, initially the ext4 work was going to be done to ext3, until Linus asked for it to be done in a newly-named clone of the code instead. It says a lot for the ext2 code and disk formats that they've been evolvable to this degree.
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
I wrote a trivial python script to generate a checksum file for each directory's files. If you run it, and it finds a checksum file, it checks that the files in the directory match the checksum file, and if they don't it reports that.
It is very cheap insurance.
Improving ext4: bigalloc, inline data, and metadata checksums
It is very cheap insurance.
Look at the price differential between the motherboards and CPUs that support ECCRAM and those that do not. Now add in the extra cost of the RAM.
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Take a look here. Note the linux version number...
And this is relevant to ext4... exactly how?
And this is relevant to ext4... exactly how?
And this is relevant to ext4... exactly how?
And this is relevant to ext4... exactly how?
And this is relevant to ext4... exactly how?
And this is relevant to ext4... exactly how?
That is the case only for fully undamaged filesystems, that is the common case of a periodic filesystem check. I have never seen any reports that the new 'e2fsck' is faster on damaged filesystems too. And since a damaged 1.5TB 'ext3' filesystem was reported take 2 months to 'fsck', even a factor of 10 is not going to help a lot.
And this is relevant to ext4... exactly how?
And this is relevant to ext4... exactly how?
unclean shutdown
is usually not that damaged
, which can however happen with a particularly bad unclean shutdown (lots of stuff in flight, for example on a wide RAID) or RAM/disk errors. The report I saw was not for a "enterprise" system with battery, ECC and a redundant storage layer.And this is relevant to ext4... exactly how?
And this is relevant to ext4... exactly how?
And this is relevant to ext4... exactly how?
We could fix things so that as you delete files from a full file system, we reduce the high watermark field for each block group's inode table
That seems hard to me. It's easy to tell if you need to increase the high watermark when adding a new file: but when you delete one, how can you tell what to reduce the high watermark to without doing a fairly expensive scan?
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
You can't implement a COW tree without writing all the way up the tree. You write a new node to the tree, so you have to have the tree point to it. You either copy an existing parent node and fix it, or you overwrite it in place. If you do the latter, then you aren't doing COW. If you copy the parent node, then its parent is pointing to the wrong place, all the way up to the root.
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
> > No journaled filesystem is good for SSDs
> Just to get the argument out in the open, what is the basis
> for making this claim?
Improving ext4: bigalloc, inline data, and metadata checksums
rather than allocate single blocks, a filesystem using clusters will allocate them in larger groups
Like FAT, only less forced-by-misdesign. Everything old is new again...
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
bigalloc
bigalloc
> bad news for us.
bigalloc
bigalloc
bigalloc
bigalloc
# du -sm /usr/share/icons/oxygen/icon-theme.cache
69 /usr/share/icons/oxygen/icon-theme.cache
# filefrag /usr/share/icons/oxygen/icon-theme.cache
/usr/share/icons/oxygen/icon-theme.cache: 1 extent found
# df -T /usr/share/icons/oxygen/icon-theme.cache
Filesystem Type 1M-blocks Used Available Use% Mounted on
/dev/sda3 ext4 25383 12558 11545 53% /
bigalloc
> on it but it's already surprisingly good. But if bigalloc clusters mean
> the unit of caching is larger, it seems like bad news for us.
bigalloc
bigalloc
> a previously unallocated block on a full filesystem for example.
> incantations of madvise(). Is that wrong? It didn't benchmark as wrong on
> toy systems, but I would have to check whether we actually re-tested on
> the big machines.
> ...
> if (VM_RandomReadHint(vma))
> return;
> ...
> }
My usual luck holds...the upcoming e2fsprogs release mentioned in the article became official moments after my last look at the ext4 mailing list before posting.
e2fsprogs
e2fsprogs
e2fsprogs
As a SLES customer reading these (great) LWN articles just gives me the feeling I'm once again on the wrong side of the filesystem situation.
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
industry standard
) and the sad story with Namesys.
Did anyone ever fix the ReiserFS tools to the point that you could safely fsck a ReiserFS volume that contained an uncompressed ReiserFS image?
Improving ext4: bigalloc, inline data, and metadata checksums
deep recovery and embedded filesystem images
deep recovery and embedded filesystem images
Rebuild tree a useful feature with side-effects.
Rebuild tree a useful feature with side-effects.
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
Improving ext4: bigalloc, inline data, and metadata checksums
I wonder if this can be made to work well with SSD eraseblocks