ext4 and data loss

Posted Mar 12, 2009 18:12 UTC (Thu) by davecb (subscriber, #1574)
In reply to: ext4 and data loss by jimparis
Parent article: ext4 and data loss

On a system that predates POSIX and/or logging filesystems, you will get the behavior you expect: this is exactly the Unix V6 behavior. The data blocks will be written out, then the inode's length field will be updated, then the (atomic) rename will compete and the file will be replaced.

POSIX doesn't guarantee that: it allows people experimenting with delaying or reordering for performance reasons to weaken the guarantees.

Research filesystems tried both, and found that one could get considerable performance advantages by reordering the writes to be in elevator order, and delaying them until there was enough data to coalesce adhacent writes. Some of this is now broadly available SCSI's "tag queueing". Alas, if a write failed, the on-disk data was now inconsistent, and one could end up with a disk of garbage.

A former colleague, then at UofT, found he could reorder and coalesce with great benefit, so long as he inserted "barriers" into the sequence where there were correctness-critical orderings. Those has to remain, but most of the performance could be kept, with a write cache and a delay of a few seconds.

Now we're working with journaled filesystems, which reduce the cost of preserving order even more, but have separated metadata from data updates. This introduced an new opportunity to inadvertently order updates that broke the older, but unpublished, correctness criteria.

Some journaled filesystems guarantee that the sequence you (and I) use is correctness-preserving. ZFS is one of these. Others, including ext3 and 4, leave a window in which a crash will will render the filesystem inconsistent. Ext3 has a small window, and for unknown reasons, ext4 has one as wide as the delay period.

I'm of the opinion both could have arbitrarily small risk periods, and with a persistent write cache or journal, both can avoid all risk. However, changing the algorithm to one which is correctness-preserving would arguably be a better answer.

--dave