A NILFS2 score card

November 7, 2012

This article was contributed by Neil Brown

A recurring theme in the comments on various articles announcing the new f2fs "flash-friendly" filesystem was that surely some other filesystem might already meet the need, or could be adjusted to meet the need, rather than creating yet another new filesystem. This is certainly an interesting question, but not one that is easy to answer. The cost/benefit calculation for creating a new filesystem versus enhancing an existing one involves weighing many factors, including motivational, social, political, and even, to some extent, technical. It is always more fun building something from scratch; trying to enter and then influence an existing community is at best unpredictable.

Of the various factors, the only ones for which there is substantial visibility to the outside observer are the technical factors, so while they may not be the major factors, they are the ones that this article will explore. In particular, we will examine "NILFS2", a filesystem which has been part of Linux for several years and is — superficially at least — one the best contenders as a suitable filesystem for modest-sized flash storage.

NILFS2 was not written primarily to be a flash-based filesystem, so comparing it head-to-head on that basis with f2fs (which was) might not be entirely fair. Instead we will examine it on its own merits, comparing it with f2fs occasionally to provide useful context, and ask "could this have made a suitable base for a flash-focused filesystem?"

`NILFS2`: what is it?

NILFS2 is the second iteration of the "New Implementation of a Log-structured File System". It is described as a "Continuous Snapshotting" filesystem, a feature which will be explored in more detail shortly.

NILFS2 appears to still be under development, with lots of core functionality present, but a number of important features still missing, such as extended attributes, quotas, and an fsck tool. As such, it is in a similar state to f2fs: well worth experimenting with, but not really ready for production usage yet.

In contrast with f2fs, NILFS2 uses 64 bits for all block addressing, 96 bits for time stamps (nanoseconds forever!), but only 16 bits for link counts (would you ever have more than 65535 links to a file, or sub-directories in one directory?). F2fs, in its initial release, uses 32 bits for each of these values.

While f2fs is a hybrid LFS (Log-structured Filesystem), using update-in-place in a number of cases, NILFS2 is a pure LFS. With the exception of the superblock (stored twice, once at either end of the device), everything is written in one continuous log. Data blocks are added to the log, then indexing information, then inodes, then indexes for the inodes, and so on. Occasionally a "super root" inode is written, from which all other blocks in the filesystem can be found. The address of the latest "super root" is stored in the superblock, along with static values for various parameters of the filesystem and a couple of other volatile values such as the number of free blocks.

Whenever a collection of blocks is written to the log, it is preceded by a segment summary which identifies all the blocks in the segment (similar to the Segment Summaries of f2fs which are stored in a separate area). Consecutive segment summaries are linked together so that, in the event of a crash, all the segment summaries since the most recent super root can be read and the state of the filesystem can be reconstructed.

The segment size can be chosen to be any number of blocks, which themselves must have a size that is a power of two up to the page size of the host system. The default block size is 4KB and the default device segment size is 8MB. Segments can easily be made to line up with erase blocks in a flash device, providing their size is known. While NILFS2 tries to write whole segments at a time, it is not always possible, so a number of consecutive partial segments might be written, each with their own segment summary block.

Being a pure LFS, NILFS2 will never write into the middle of an active segment — as f2fs does when space is tight. It insists on "cleaning" partially used segments (copying live data to a new segment) to make more space available, and does not even keep track of which particular blocks in a segment might be free. If there are no clean segments beyond those reserved for cleaning, the filesystem is considered to be full.

Everything is a file!

The statement "Everything is a file" is part of the Unix tradition and, like many such statements, it sounds good without tying you down to meaning very much. Each of "file", "everything" and even "is a" is open to some interpretation. If we understand "file" to be a collection of data and index blocks that provide some linearly addressed storage, "everything" to mean most data and metadata — excepting only the superblock and the segment summaries — and "is a" to mean "is stored inside a", then NILFS2 honors this Unix tradition.

For example, in a more traditional filesystem such as ext3, inodes (of which there is one per file) are stored at fixed locations in the device — usually a number of locations distributed across the address space, but fixed nonetheless. In f2fs, a hybrid approach is used where the addresses of the inodes are stored in fixed locations (in the Node Address Table — NAT), while the inodes themselves appear in the log, wherever is convenient. For NILFS2, the inode table is simply another file, with its own inode which describes the locations of the blocks.

This file (referred to as the ifile) also contains a bitmap allowing unused inodes to be found quickly, and a "block group descriptor" which allows non-empty bitmaps to be found quickly. With the default block size, every 2²⁵ inodes has 1024 blocks for bitmaps, and one descriptor block, which lists how many bits are set in each of those bitmaps. If you want more inodes than that, a second descriptor block will be automatically allocated.

The inodes themselves are a modest 128 bytes in size, and here I must confess to an oversimplification in the article on f2fs. The statement that "Copy-on-write is rather awkward for objects that are smaller than the block size" holds a grain of truth, but isn't really true as it stands. The reality is more subtle.

The advantages of a small inode size are primarily in efficiency. Less space can be wasted, and fewer I/O requests are needed to load the same number of inodes. For a traditional filesystem with pre-allocated inode regions, the space wasted can be a significant issue. However, that does not really apply to an LFS which allocates the space on demand. The speed issue is slightly harder to reason about. Certainly if all the inodes for files in one directory live in one block, then the common task of running "ls -l" will be expedited. However if more information, such as extended attributes or file data for small files, is stored in a big inode, accessing that will require only one block to be read, not two.

The advantages of a block-sized inode — apart from the extra space, which is of uncertain value — is that inodes can be updated independently. OCFS2 (a cluster-based filesystem) uses this to simplify the locking overhead — a cluster node does not need to gain exclusive access to inodes in the same block as the one that it is interested in when it performs an update, because there aren't any. In an LFS, the main issue is reducing cleaning overhead. As we noted in the f2fs article, grouping data with similar life expectancy tends to reduce the expected cost of cleaning, so storing an inode together with the data that it refers to is a good idea. If there are several inodes in the one block, then the life expectancy will be the minimum for all the inodes, and so probably quite different from nearby data. This could impose some (hard to measure) extra cleaning cost.

On the whole, it would seem best for an LFS if the one-inode-per-block model were used, as there is minimal cost of wasted space and real opportunities for benefits. If ways are found to make maximal use of that extra space, possibly following some ideas that Arnd Bergmann recently suggested, then block-sized inodes would be even more convincing.

Small inodes might be seen as a reason not to choose NILFS2, though not a very strong reason. Adjusting NILFS2 to allow full-block inodes would not be a large technical problem, though it is unknown what sort of social problem it might be.

As with most filesystems, NILFS2 also stores each directory in a "file" so there are no surprises there. The surprise is that the format used is extremely simple. NILFS2 directories are nearly identical to ext2 directories, the only important difference being that they store a 64-bit inode number rather than just 32 bits. This means that any lookup requires a linear search through the directory. For directories up to about 400 entries (assuming fairly short names on average), this is no different from f2fs. For very large directories, the search time increases linearly for NILFS2, while it is logarithmic for f2fs. While f2fs is not particularly efficient at this task, NILFS2 clearly hasn't made any effort at effective support for large directories. There appears to be an intention to implement some sort of B-tree based directory structure in the future, but this has not yet happened.

Use the right tree for the job

If everything is a file, then it is clearly important to know what a file is. It starts at the inode which contains space for seven 64-bit addresses. When the file is small (seven blocks or less) these contain pointers to all of the allocated blocks. When the file is larger, this space changes to become the root of a B-tree, with three keys (file addresses), three pointers to other B-tree nodes, and a small header.

The interesting thing about this B-tree is that the leaves do not contain extents describing contiguous ranges of blocks; instead they describe each block individually. This is interesting because it does not fit the typical use-case for a B-tree.

The particular value of a B-tree is that it remains balanced independently of the ordering or spacing of keys being inserted or removed. There is a cost that blocks may occasionally need to be split or merged, but this is more than compensated for by the ability to add an unpredictable sequence of keys. When extents are being stored in a tree, it is not possible to predict how long each extent will be, or when an extent might get broken, so the sequence of keys added will be unpredictable and a B-tree is ideal.

When the keys being added to the index are the offsets of consecutive blocks, then the sequence is entirely predictable and a different tree is likely to be preferred. A radix tree (where the path through the tree is a simple function of the key value) is much more compact than a B-tree (as there is no need to store keys) and much simpler to code. This is the sort of tree chosen for f2fs, the tree used for ext3, and generally the preferred choice when block extents are not being used to condense the index.

The only case where a B-tree of individual blocks might be more efficient than a radix tree is where the file is very sparse, having just a few allocated blocks spread throughout a large area that is unallocated. Sparse files are simply not common enough among regular files to justify optimizing for them. Nor are many of the special files that NILFS2 uses likely to be sparse. The one exception is the checkpoint file (described later), and optimizing the indexing strategy for that one file is unlikely to have been a motivation.

So we might ask "why?". Why does NILFS2 use a B-tree, or why does it not use extents in its addressing? An early design document [PDF] suggests that B-trees were chosen due to their flexibility, and while it isn't yet clear that the flexibility is worth the cost, future developments might show otherwise. The lack of extent addressing can be explained with a much more concrete answer once we understand one more detail about file indexing.

Another layer of indirection

The headline feature for NILFS2 is "continuous snapshotting". This means that it takes a snapshot of the state of the filesystem "every few seconds". These are initially short-term snapshots (also called "checkpoints"), and can be converted to long-term snapshots, or purged, by a user-space process following a locally configurable policy. This means there are very likely to be lots of active snapshots at any one time.

As has been mentioned, the primary cost of an LFS is cleaning — the gathering of live data from nearly-empty segments to other segments so that more free space can be made available. When there is only one active filesystem, each block moved only requires one index to be updated. However, when there are tens or hundreds of snapshots, each block can be active in a fairly arbitrary sub-sequence of these, so relocating a block could turn into a lot of work in updating indices.

Following the maxim usually attributed to David Wheeler, this is solved by adding a level of indirection. One of the special files that NILFS2 uses is known as the "DAT" or "Disk Address Translation" file. It is primarily an array of 64-bit disk addresses, though the file also contains allocation bitmaps like the ifile, and each entry is actually 256 bits as there is also a record of which checkpoints the block is active in. The addresses in the leaves of the indexing trees for almost all files are not device addresses but are, instead, indexes into this array. The value found contains the actual device address. This allows a block to be relocated by simply moving the block and updating this file. All snapshots will immediately know where the new location is. Doing this with variable length extents would be impractical, which appears to be one of the main reasons that NILFS2 doesn't use them.

It should be noted that while this DAT file is similar to the NAT used by f2fs, it is different in scale. The f2fs NAT is used only to provide indirection when looking up nodes — inodes, index blocks, etc. — not when looking up data blocks. The DAT file is used for lookup of all blocks. This indirection imposes some cost on every access.

Estimating that cost is, again, not easy. Given a 4K block size, each block in the DAT file provides indexing for 128 other blocks. This imposes approximately a 1% overhead in storage space and at least a 1% overhead in throughput. If all the DAT entries for a given file are adjacent, the overhead will be just that 1%. If they are very widely spread out, it could be as much as 100% (if each DAT entry is in a different block of the DAT file). Files that are created quickly on a fresh filesystem will tend to the smaller number, files created slowly (like log files) on a well-aged filesystem will likely tend toward a larger number. An average of 3% or 4% probably wouldn't be very surprising, but that is little more than a wild guess.

Against this cost we must weigh the benefit, which is high frequency snapshots. While I have no experience with this feature, I do have experience with the "undo" feature in text editors. In my younger days I used "ed" and don't recall being upset by the lack of an undo feature. Today I use emacs and use undo all the time — I don't know that I could go back to using an editor without this simple feature. I suspect continual snapshots are similar. I don't miss what I don't have, but I could quickly get used to them.

So: is the availability of filesystem-undo worth a few percent in performance? This is a question I'll have to leave to my readers to sort out. To make it easier to ponder, I'll relieve your curiosity and clarify why not all files use the DAT layer of indirection. The answer is of course that the DAT file itself cannot use indirection as it would then be impossible to find anything. Every other file does use the DAT and every lookup in those files will involve indirection.

Other miscellaneous metadata

NILFS2 has two other metadata files. Both of these are simple tables of data without the allocation bitmaps of the DAT file and the ifile.

The "sufile" records the usage of each segment of storage, counting the number of active blocks in the segment, and remembering when the segment was written. The former is used to allow the segment to be reused when it reaches zero. The latter is used to guide the choice of which segment to clean next. If a segment is very old, there is not much point waiting for more blocks to die of natural attrition. If it is young, it probably is worthwhile to wait a bit longer.

The "cpfile" records checkpoints and every time a checkpoint is created a new record is added to the end of this file. This record stores enough information to reconstruct the state of all files at the time of the checkpoint. In particular, this includes the inode for the ifile. Left to itself, this file would grow without bound. However, in normal operation a user-space program (nilfs_cleanerd) will monitor usage and delete old checkpoints as necessary. This results in the cpfile becoming a sparse file with lots of empty space for most of the length of the file, a dense collection of records at the end, and an arbitrary number of individual blocks sprinkled throughout (for the long-term snapshots). This is the file for which a radix-tree index may not be the optimal indexing scheme. It isn't clear that would matter though.

Pros and cons

So now we must return to the question: from a technical perspective, would NILFS2 make a suitable base of a flash-optimized filesystem? The principal property of flash, that the best writes are sequential writes aligned to the underlying erase block size, is easily met by NILFS2, making it a better contender than filesystems with lots of fixed locations, but can we gain any insight by looking at the details?

One of the several flash-focused features of f2fs is that it has several segments open for writing at a time. This allows data with different life expectancies to be kept separate, and also improves the utilization of those flash devices that allow a degree of parallelism in access. NILFS2 only has a single segment open at a time, as is probably appropriate for rotating media with a high seek cost, and makes no effort to sort blocks based on their life expectancy. Adding these to NILFS2 would be possible, but it is unlikely that it would be straightforward.

Looking at the more generally applicable features of NILFS2, the directory structure doesn't scale, the file indexing is less than optimal, and the addressing indirection imposes a cost of uncertain size. On the whole, there seems to be little to recommend it and a substantial amount of work that would be required to tune it to flash in the way that f2fs has been tuned. It gives the impression of being a one-big-idea filesystem. If you want continual snapshots, this is the filesystem for you. If not, it holds nothing of real interest.

On the other side, f2fs comes across as a one-big-idea filesystem too. It is designed to interface with the FTL (Flash translation layer) found in today's flash devices, and provides little else of real interest, providing no snapshots and not working at all well with any storage device other than flash. Could this be a sign of a trend towards single-focus filesystems? And if so, is that a good thing, or a bad thing?

So, while NILFS2 could have been used as a starting point for a flash-focused filesystem, it is not at all clear that it would have been a good starting point, and it is hard to challenge the decision to create a new filesystem from scratch. Whether some other filesystem might have made a better start will have to be a question for another day.

Index entries for this article
Kernel	Filesystems/Flash
GuestArticles	Brown, Neil

A NILFS2 score card

Posted Nov 7, 2012 8:28 UTC (Wed) by malcolmt (guest, #65441) [Link]

I really enjoy these deep-dive articles. Particularly the compare-and-contrast approach following on from the f2fs article... very instructive.

Thanks for making the time to write it all up, Neil.

A NILFS2 score card

Posted Nov 7, 2012 10:57 UTC (Wed) by etienne (guest, #25256) [Link] (10 responses)

IHMO there is another feature that FLASH system would need, it is the 0xFF hole:
When you create a hole in a file (by setting the current file pointer after the file's end and begin to write there), you get a hole that is full of 0x00 if read on standard Unix.
It may be a good idea to get (the concept of) a hole full of 0xFF so the physical device can accept rewriting (few not-page-aligned bytes in) that area.
Alternatively all data of some/every file may be XORed with 0xFF.
But I am not sure how a hole is treated in a log filesystem...

A NILFS2 score card

Posted Nov 7, 2012 14:17 UTC (Wed) by mirabilos (subscriber, #84359) [Link] (7 responses)

Why 0xFF ?

A NILFS2 score card

Posted Nov 7, 2012 14:34 UTC (Wed) by mpr22 (subscriber, #60784) [Link] (4 responses)

Because the post-erasure state of a Flash cell signifies "all bits set", not "all bits clear".

A NILFS2 score card

Posted Nov 7, 2012 14:43 UTC (Wed) by mirabilos (subscriber, #84359) [Link] (3 responses)

Ah, is that so? And for any and all Flash types?
I wonder why the hardware doesn’t flip that then…

In that case, storing data NOT’d might make sense,
though no idea about the performance impact.

A NILFS2 score card

Posted Nov 7, 2012 16:47 UTC (Wed) by cmorgan (guest, #71980) [Link] (1 responses)

I'm not sure it matters what the erased state is as long as the "erased" value is defined and handled correctly.

Also keep in mind that some flash chips may not support flipping single bits inside of bytes or flipping individual bytes etc and warn against it. I've personally done this and it worked (and we had error correction) but for some reason the manufacturer said specifically not to. Maybe this is a serial flash only issue though. Can't say I've had that much experience with a range of flash parts.

A NILFS2 score card

Posted Nov 8, 2012 16:30 UTC (Thu) by mirabilos (subscriber, #84359) [Link]

I was thinking of hardware engineers putting little NOT transistors on the D-bus of the flash chip, before attaching it to the system bus, so that a 0x00 on the system bus would be reas as 0xFF by the chip viceque versa. That would’ve been useful in this case.

But alas, if it isn’t… someone has to deal with it. (And the other comments suggest that the state may not be the same everywhere, so blindly XORing with FFh in the kernel will also not be a solution.)

A NILFS2 score card

Posted Nov 9, 2012 18:09 UTC (Fri) by BenHutchings (subscriber, #37955) [Link]

This is true for NOR flash, which normally has a direct mapping of bits to cells and supports byte writes (or maybe even bit writes). It's not true for NAND flash, even if it's SLC: the individual cells are substantially less reliable so it requires ECC and block writes. A block that has been erased most likely won't have valid contents at all.

A NILFS2 score card

Posted Nov 7, 2012 14:41 UTC (Wed) by hmh (subscriber, #3838) [Link] (1 responses)

Writing to a FLASH-like device means transitioning bits from 1 to 0 or leaving them unchanged. Erasing a FLASH-like device transitions all bits to 1.

So, a page full of 0xff is ready for writing. One full of 0x00 has to be erased to receive any other data. Only, you must erase a full block at a time, and blocks are (much) larger than pages.

A NILFS2 score card

Posted Nov 7, 2012 18:46 UTC (Wed) by zlynx (guest, #2285) [Link]

But if the hardware has a FTL you have no real idea of how the actual hardware works. You only have the view through the FTL. Reading an erased block may return 0xFF. It may return 0x00. It might return the data that was in there before the "erase" operation. It might not read from the flash hardware at all because if the FTL has a range map of free/TRIM'd/secure-erased blocks it might check the range and return a purely synthetic block.

A NILFS2 score card

Posted Nov 8, 2012 21:38 UTC (Thu) by Russ.Dill@gmail.com (guest, #52805) [Link] (1 responses)

While both NOR flash and NAND flash have this property, it isn't nearly as simple as you'd think.

For NAND flash, especially MLC NAND flash, each page is broken up into subpages that can be written individual, but must be written in order, and only a certain number of writes are allowed.

For both NAND and NOR, just because something looks like 0xFF doesn't mean its actually 0xFF, it could be an incomplete erase, and it may not read as 0xFF the next time.

Also, since we are talking log structured file systems, you don't allocate space for files anyway, you allocate space for writes. In the case of a file hole, you just don't allocate space.

The more I think about this, the less sense it makes, 0x00's don't appear in a file with a hole because 0x00's exist on the disk anyway...

A NILFS2 score card

Posted Nov 9, 2012 9:51 UTC (Fri) by etienne (guest, #25256) [Link]

> since we are talking log structured file systems, you don't allocate space for files anyway, you allocate space for writes.

I have never gone into log structured file systems, I do not know how they handle the pattern people use to create contiguous files on standard filesystems (write a byte at beginning of the file, write a byte at the end of the file of known size, flush the file). Even the best filesystems cannot guess the final size of the file without hint, and even with delayed allocation at some point something has to be written to the disk.

> In the case of a file hole, you just don't allocate space.

But holes can appear at a random position, and usually the filesystem will not allocate blocks in a file hole on the parts aligned to sector sizes, so the filesystem has to write the default value on parts which are not aligned.
Then, the user usually begin to fill the hole that he has just created, so the FLASH driver has just zeroed the FLASH and is forced to erase it again.

A NILFS2 score card

Posted Nov 7, 2012 19:48 UTC (Wed) by GhePeU (subscriber, #56133) [Link]

In contrast with f2fs, NILFS2 uses [...] 96 bits for time stamps (nanoseconds forever!)

Nanosecond timestamps were added in F2FS v2: http://thread.gmane.org/gmane.linux.kernel/1380001/

A NILFS2 score card

Posted Nov 7, 2012 23:25 UTC (Wed) by ikm (guest, #493) [Link] (1 responses)

> would you ever have more than 65535 [..] sub-directories in one directory?

Easily. This looks like a real (rather than a hypothetical) limitation.

A NILFS2 score card

Posted Nov 8, 2012 16:31 UTC (Thu) by mirabilos (subscriber, #84359) [Link]

Actually, both people I know and we admins at my employer’s place have run into this issue with some filesystems. Mostly on mailservers and web caches, though.

NILFS2 isn't limited to flash storage... or small storage.

Posted Nov 10, 2012 21:43 UTC (Sat) by wazoox (subscriber, #69624) [Link] (1 responses)

NILFS2 is perfectly suited to humongous filesystems. I've tested nilfs2 volumes of several tens of terabytes; I have a 5 TB NILFS2 volume on a RAID array with many Debian mirrors and other stuff in constant use for more than 3 years. NILFS2 is stable. NILFS2 is reliable. NILFS2 is great, go and use it.
Yes, the fact that it's missing an fsck is a bit annoying, but it's really fine nonetheless.

NILFS2 isn't limited to flash storage... or small storage.

Posted Nov 15, 2012 11:44 UTC (Thu) by the.wrong.christian (guest, #73127) [Link]

> NILFS2 is perfectly suited to humongous filesystems. I've tested nilfs2 volumes of several tens of terabytes; I have a 5 TB NILFS2 volume on a RAID array with many Debian mirrors and other stuff in constant use for more than 3 years. NILFS2 is stable. NILFS2 is reliable. NILFS2 is great, go and use it.

Might I add it is a fantastic FS for low end flash devices, like SD cards and USB thumb drives. These drives are mostly unusable with regular FS like ext4, but perfectly usable with NILFS2. Sure, you have cleaning overhead, but that can be scheduled during quiet times.

The lack of fsck is more of a problem with distribution integration. I find that on startup, a nilfs root fs needs to be remounted manually to start the cleaner, presumably because fsck is missing. Disabling fsck in fstab doesn't seem to help.

A NILFS2 score card

Posted Dec 7, 2012 22:29 UTC (Fri) by CurtBrune (subscriber, #88208) [Link]

I echo others sentiments -- nicely written and informative. Keep them coming.

A NILFS2 score card

NILFS2: what is it?

Everything is a file!

Use the right tree for the job

Another layer of indirection

Other miscellaneous metadata

Pros and cons

A NILFS2 score card

A NILFS2 score card

A NILFS2 score card

A NILFS2 score card

A NILFS2 score card

A NILFS2 score card

A NILFS2 score card

A NILFS2 score card

A NILFS2 score card

A NILFS2 score card

A NILFS2 score card

A NILFS2 score card

A NILFS2 score card

A NILFS2 score card

A NILFS2 score card

NILFS2 isn't limited to flash storage... or small storage.

NILFS2 isn't limited to flash storage... or small storage.

A NILFS2 score card

`NILFS2`: what is it?