Accessing zoned block devices with zonefs

By Jake Edge
July 23, 2019

Zoned block devices are quite different than the block devices most people are used to. The concept came from shingled magnetic recording (SMR) devices, which allow much higher density storage, but that extra capacity comes with a price: less flexibility. Zoned devices have regions (zones) that can only be written sequentially; there is no random access for writes to those zones. Linux already supports these devices, and filesystems are adding support as well, but some applications may want a simpler, more straightforward interface; that's what a new filesystem, zonefs, is targeting.

Damien Le Moal posted an RFC patch series for zonefs to the linux-fsdevel mailing list in mid-July. He also spoke about zonefs at the Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) back in May. It is a way for applications to use the POSIX file API, "rather than relying on direct block device file ioctls and read/write". Applications that use log-structured merge-trees (such as RocksDB and LevelDB) will be able to use zoned block devices more easily via zonefs, Le Moal said.

Zoned block devices typically have both conventional zones—those that allow normal random-access reads and writes—and sequential zones, which only allow writing to the end of the zone. Sequential zones each have a write pointer stored by the device that indicates where the next write operation will be done for that zone. Zonefs simply exposes the zones as files in its filesystem.

A mounted zonefs will have two top-level directories: cnv for conventional zones and seq for sequential zones. Those directories will contain a fixed set of files that correspond to the zones in the device. By default, those files will be named with consecutive integers representing the order of the zones reported by blkdev_report_zones() when the filesystem is mounted; zones will effectively be numbered based on the order of their starting sector. A mounted filesystem might look something like this:

    mnt/
    |
    |--- cnv/
    |    |--- 0
    |    |--- 1
    |    |--- 2
    |    ...
    |
    |--- seq/
         |--- 0
         |--- 1
         |--- 2
         |--- 3
         ...

The first zone is reserved for a superblock, so it does not appear in the hierarchy. The superblock has just a little bit of metadata: a magic number, a UUID, and some feature flags that were given as part of the filesystem create operation (which is done with mkzonefs). One of the feature flags will cause zonefs to aggregate all of the conventional zones into a single zone; conventional zones tend to be much smaller on these devices, so aggregation may well make sense. A normal Linux filesystem could be created on the aggregated zone, for example. The default file-name scheme can also be changed by a feature flag to have the file names reflect the sector number of the start of the zone instead. The other two flags will set the user and group IDs (root.root by default) or the file permissions (0640 by default).

The filesystem is very restrictive; no files or directories can be created on it, for example, nor can files have their owners or permissions changed. The conventional zones cannot be truncated and the sequential zones can only be truncated to zero, which allows them to be completely overwritten. Any read or write beyond the size of the underlying zone will result in an EFBIG ("File too large") error. The reported file size will be the full size of the conventional zone (or zones if they are aggregated); for sequential zones it will be the location of the write pointer.

Johannes Thumshirn, who contributed some code to zonefs (as did Christoph Hellwig), wondered if the UID/GID and permissions should be set via mount options, rather than only at filesystem-creation time; a filesystem feature flag could still govern the ability to change those attributes. Le Moal replied that he had implemented that feature along the way, but decided against keeping it:

I switched to the static format time definition only so that the resulting operation of the FS is a little more like a normal file system, namely, mounting the device does not change file attributes and so can be mounted and seen with the same attribute no matter where it is mounted, regardless of the mount options.

Thumshirn agreed with Le Moal's thinking but has a different use case in mind. SMR drives could be formatted for zonefs, then handed out to various administrators who could determine the right UID/GID and permissions for their application. This is an area that requires some more thinking, Thumshirn said.

Jeff Moyer expressed concern that zonefs breaks most of the expectations that users have for what a filesystem is. He would rather see some other solution, such as a user-space library (which Le Moal said he had considered) or perhaps a device-mapper target that exposed each zone as a separate block device. Le Moal pointed out that handling each zone as a block device is problematic:

Well, I do not think you need a new device mapper for this. dm-linear supports zoned block devices and will happily allow mapping a single zone and expose a block device file for it. My problem with this approach is that SMR drives are huge, and getting bigger. A 15 TB drive has 55380 zones of 256 MB. Upcoming 20 TB drives have more than 75000 zones. Using dm-linear or any per-zone device mapper target would create a huge resources pressure as the amount of memory alone that would be used per zone would be much higher than with a file system and the setup would also take far longer to complete compared to zonefs mount.

Dave Chinner agreed with that assessment. Le Moal said that he would rather point people at a regular filesystem that has zoned block device support, such as Btrfs, where the feature is in progress, or, eventually, XFS (which is planned), but that some application developers often want to dispense with most or all of what filesystems provide. The idea is that zonefs provides just enough of a filesystem for those developers: "zonefs fits in the middle ground here between removing the normal file system and going to raw block device".

No strong objections were heard in the thread (or in the LSFMM session, for that matter). It is a bit of a strange filesystem, but would provide easy access to these zoned block devices from applications. The semantics of a "file" (especially in the seq directory) would be rather different than the usual POSIX semantics, but would be precisely what certain applications need. The next step would seemingly be to bring zonefs to the Linux kernel mailing list and from there, perhaps, into the mainline in a cycle or two.

Index entries for this article
Kernel	Block layer/Zoned devices
Kernel	Filesystems

Accessing zoned block devices with zonefs

Posted Jul 23, 2019 23:44 UTC (Tue) by JohnVonNeumann (guest, #131609) [Link] (8 responses)

For the FS/Storage noobs in here, what would be the ideal use case for zoned block devices or zonefs?

And how does atomicity work with this? The article states that there is no random access for writes to the zones, does this mean that an entire zone (like a block) has to be allocated for a single write, and further writes to the zone would require a zeroing of the zone, before write? Or can a zone be used for multiple operations, provided that the previous blocks aren't touched?

Accessing zoned block devices with zonefs

Posted Jul 24, 2019 1:04 UTC (Wed) by epithumia (subscriber, #23370) [Link] (3 responses)

The use case for these things is basically write-rarely data. The devices are larger than the available non-shingled drives but can't really be used in the same way (or at least not efficiently).

And yes, you must write an entire zone at once. But the devices generally have a region which accepts random writes (or are paired with another device which serves that function), so you can potentially accumulate a zone's worth of data over time and then move that to a zone all at once. There is plenty of room for smart filesystems here.

Accessing zoned block devices with zonefs

Posted Jul 24, 2019 4:48 UTC (Wed) by smurf (subscriber, #17840) [Link] (1 responses)

Do you really need to write a whole zone at once? I thought these drives have per-zone append (until the zone is full of course).

Accessing zoned block devices with zonefs

Posted Jul 24, 2019 13:29 UTC (Wed) by willy (subscriber, #9762) [Link]

Your understanding is correct; epithumia is mistaken.

Accessing zoned block devices with zonefs

Posted Jul 24, 2019 10:24 UTC (Wed) by james (subscriber, #1325) [Link]

Another use case is video surveillance, where you're continually recording video and want to keep as much as possible before you need to over-write.

Commercial devices based on a zoned drive could have an embedded OS in the conventional zones to keep the bill of materials down.

Accessing zoned block devices with zonefs

Posted Jul 24, 2019 10:36 UTC (Wed) by Sesse (subscriber, #53779) [Link] (3 responses)

The use case is really “large drives”. :-) SMR drives are built this way; you can't write to the middle of a zone, because the tracks sort-of overlap and must be written from the inside out (or is it outside in, I have no idea). You can either have the drive simulate a normal drive (rather slowly), or you can expose the functionality to userspace, where it fits really nicely into some kinds of storage patterns already.

In particular, anything BigTable-like (generically called LSM) will never overwrite existing data on write; they'll just write a correction record (“whatever the address for user X used to be, now it's Y”, or “delete user X”), where the last value wins. When needed, they'll have garbage collection, where they read an entire chunk (or zone, in this case), prune out all the old values, and write a new tree. It fits perfectly well with the append-only nature of these zones, so if you know how the zones are structured, you take nearly no hit from using an SMR disk instead of a conventional disk.

This does really feel like a transient thing, though. Host-managed SMR disks are not really all that much used yet, and as they ramp up, it's into a future where flash keeps getting cheaper and cheaper, and eventually looks like it will be overtaking rotating media in terms of cost per byte.

Accessing zoned block devices with zonefs

Posted Jul 24, 2019 13:28 UTC (Wed) by willy (subscriber, #9762) [Link] (2 responses)

Bad news, SSDs also want zones, although for very different reasons from shingled drives.

One of the larger pieces of cost for an SSD is the RAM used to run the FTL. If you can shrink the FTL by disallowing random writes, you save a lot of money. So we're going to see zoned SSDs too.

Accessing zoned block devices with zonefs

Posted Jul 25, 2019 10:52 UTC (Thu) by ptman (subscriber, #57271) [Link] (1 responses)

IIRC some SSDs are already partially zoned. MLC drives can have some area used as SLC (giving greater write endurance) for FTL garbage collection and bookkeeping.

Accessing zoned block devices with zonefs

Posted Jul 25, 2019 21:04 UTC (Thu) by Sesse (subscriber, #53779) [Link]

SSDs are definitely zoned, but they're not exposing any of it to userspace (yet).

Usually SLC is for speed and not endurance, by the way.

Accessing zoned block devices with zonefs

Posted Jul 24, 2019 14:40 UTC (Wed) by Freeaqingme (subscriber, #103259) [Link] (5 responses)

If ZoneFS were to become a thing, it's only a matter of time before features are being added to it until feature parity with XFS/ext4 is reached. If it's feasible to implement in Btrfs and XFS, I'd say to just implement it there and don't bother implementing a separate FS.

> The idea is that zonefs provides just enough of a filesystem for those developers [...]

In what way would those developers be hindered by using a regular FS?

Accessing zoned block devices with zonefs

Posted Jul 24, 2019 14:57 UTC (Wed) by nivedita76 (guest, #121790) [Link] (4 responses)

The regular file systems don’t support efficiently using such a device (yet). This seems like a transitional thing until they do. It would seem like it shouldn’t be too heavy a lift for CoW filesystems to understand the zones while allocating new blocks, but it may never be possible for non-CoW filesystems to be efficient.

Accessing zoned block devices with zonefs

Posted Jul 24, 2019 22:36 UTC (Wed) by epa (subscriber, #39769) [Link]

It seems as though an ordinary, non-COW file system could work well enough with one of these drives provided the application itself is append-only. So you provide some hint to indicate a particular file will only be appended to (in normal use) and the file system sticks it on one or more of the append-only zones. That big append-only file could be a log file for a database or whatever. The amount of smarts needed in the file system itself is pretty small.

Accessing zoned block devices with zonefs

Posted Jul 25, 2019 12:22 UTC (Thu) by Baughn (subscriber, #124425) [Link]

The chief problem for a CoW filesystem would be garbage collection. If you look at ZFS for example, it depends on being able to reuse blocks previously belonging to files that have been modified, and those blocks might be in the middle of a sequential zone.

We might need a background process running something like a copying GC, which can't be retrofitted without significant work.

Accessing zoned block devices with zonefs

Posted Jul 27, 2019 5:33 UTC (Sat) by flussence (guest, #85566) [Link] (1 responses)

Wouldn't a log-structured filesystem be a good fit for these? Mostly append-only, can use the conventional zones for bookkeeping and garbage collection workspace.

Accessing zoned block devices with zonefs

Posted Jul 31, 2019 17:42 UTC (Wed) by anton (subscriber, #25547) [Link]

Yes, classic log-structures file systems work with segments (i.e., zones) that are written sequentially. To reclaim space, the live data from some segments is copied to a new segment. COW file systems (e.g., BTRFS) generally use a similar organization as log-structured file systems, except for free-space management: COW file systems don't garbage-collect, but keep track of free blocks, and are therfore not so great on zoned devices.

The article refers to SMR HDDs. What I have seen in the HDD market in the last few years was that SMR drives were only a little larger than conventional drives, and looking for the largest drives (16TB) today, all offers are for conventional drives. It seems that SMR is on the way out. However, NAND flash only allows erasing big blocks, and despite announcements for years that we are going to get technologies without this restriction, NAND flash seems to be still going strong. However, this property of NAND flash is usually hidden behind a flash translation layer that makes it look like a normal block device.

Accessing zoned block devices with zonefs

Posted Jul 26, 2019 2:40 UTC (Fri) by gdt (subscriber, #6284) [Link]

Could I suggest that filesystem extended attributes might be a good idea? Often the people who want good sequential performance from drives -- and are willing to work with odd filesystems to get it -- also work in environments where there are increasing rules around data security. As one example, government science agencies.

Accessing zoned block devices with zonefs

Posted Jan 16, 2020 15:13 UTC (Thu) by lzap (guest, #73396) [Link]

Is there a good resource on how those SMR-managed drives actually work internally these days?