Accessing zoned block devices with zonefs
Zoned block devices are quite different than the block devices most people are used to. The concept came from shingled magnetic recording (SMR) devices, which allow much higher density storage, but that extra capacity comes with a price: less flexibility. Zoned devices have regions (zones) that can only be written sequentially; there is no random access for writes to those zones. Linux already supports these devices, and filesystems are adding support as well, but some applications may want a simpler, more straightforward interface; that's what a new filesystem, zonefs, is targeting.
Damien Le Moal posted
an RFC patch series for zonefs to the linux-fsdevel mailing list in
mid-July. He also spoke about zonefs at
the Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) back in
May. It is a way for applications to use the POSIX file API, "rather
than relying on direct block
device file ioctls and read/write
". Applications that use
log-structured
merge-trees (such as RocksDB and LevelDB) will be able to use
zoned block devices more easily via zonefs, Le Moal said.
Zoned block devices typically have both conventional zones—those that allow normal random-access reads and writes—and sequential zones, which only allow writing to the end of the zone. Sequential zones each have a write pointer stored by the device that indicates where the next write operation will be done for that zone. Zonefs simply exposes the zones as files in its filesystem.
A mounted zonefs will have two top-level directories: cnv for conventional zones and seq for sequential zones. Those directories will contain a fixed set of files that correspond to the zones in the device. By default, those files will be named with consecutive integers representing the order of the zones reported by blkdev_report_zones() when the filesystem is mounted; zones will effectively be numbered based on the order of their starting sector. A mounted filesystem might look something like this:
mnt/ | |--- cnv/ | |--- 0 | |--- 1 | |--- 2 | ... | |--- seq/ |--- 0 |--- 1 |--- 2 |--- 3 ...
The first zone is reserved for a superblock, so it does not appear in the hierarchy. The superblock has just a little bit of metadata: a magic number, a UUID, and some feature flags that were given as part of the filesystem create operation (which is done with mkzonefs). One of the feature flags will cause zonefs to aggregate all of the conventional zones into a single zone; conventional zones tend to be much smaller on these devices, so aggregation may well make sense. A normal Linux filesystem could be created on the aggregated zone, for example. The default file-name scheme can also be changed by a feature flag to have the file names reflect the sector number of the start of the zone instead. The other two flags will set the user and group IDs (root.root by default) or the file permissions (0640 by default).
The filesystem is very restrictive; no files or directories can be created on it, for example, nor can files have their owners or permissions changed. The conventional zones cannot be truncated and the sequential zones can only be truncated to zero, which allows them to be completely overwritten. Any read or write beyond the size of the underlying zone will result in an EFBIG ("File too large") error. The reported file size will be the full size of the conventional zone (or zones if they are aggregated); for sequential zones it will be the location of the write pointer.
Johannes Thumshirn, who contributed some code to zonefs (as did Christoph Hellwig), wondered if the UID/GID and permissions should be set via mount options, rather than only at filesystem-creation time; a filesystem feature flag could still govern the ability to change those attributes. Le Moal replied that he had implemented that feature along the way, but decided against keeping it:
Thumshirn agreed with Le Moal's thinking but has a different use case in mind. SMR drives could be formatted for zonefs, then handed out to various administrators who could determine the right UID/GID and permissions for their application. This is an area that requires some more thinking, Thumshirn said.
Jeff Moyer expressed concern that zonefs breaks most of the expectations that users have for what a filesystem is. He would rather see some other solution, such as a user-space library (which Le Moal said he had considered) or perhaps a device-mapper target that exposed each zone as a separate block device. Le Moal pointed out that handling each zone as a block device is problematic:
Dave Chinner agreed
with that assessment. Le Moal said that he would rather point people at a
regular filesystem that has zoned block device support, such as Btrfs,
where the feature is in progress, or,
eventually, XFS (which is planned), but that some application developers often
want to dispense with most or all of what filesystems provide. The idea is
that zonefs provides just enough of a filesystem for those developers:
"zonefs fits in the middle
ground here between removing the normal file system and going to raw block
device
".
No strong objections were heard in the thread (or in the LSFMM session, for that matter). It is a bit of a strange filesystem, but would provide easy access to these zoned block devices from applications. The semantics of a "file" (especially in the seq directory) would be rather different than the usual POSIX semantics, but would be precisely what certain applications need. The next step would seemingly be to bring zonefs to the Linux kernel mailing list and from there, perhaps, into the mainline in a cycle or two.
Index entries for this article | |
---|---|
Kernel | Block layer/Zoned devices |
Kernel | Filesystems |
Posted Jul 23, 2019 23:44 UTC (Tue)
by JohnVonNeumann (guest, #131609)
[Link] (8 responses)
And how does atomicity work with this? The article states that there is no random access for writes to the zones, does this mean that an entire zone (like a block) has to be allocated for a single write, and further writes to the zone would require a zeroing of the zone, before write? Or can a zone be used for multiple operations, provided that the previous blocks aren't touched?
Posted Jul 24, 2019 1:04 UTC (Wed)
by epithumia (subscriber, #23370)
[Link] (3 responses)
And yes, you must write an entire zone at once. But the devices generally have a region which accepts random writes (or are paired with another device which serves that function), so you can potentially accumulate a zone's worth of data over time and then move that to a zone all at once. There is plenty of room for smart filesystems here.
Posted Jul 24, 2019 4:48 UTC (Wed)
by smurf (subscriber, #17840)
[Link] (1 responses)
Posted Jul 24, 2019 13:29 UTC (Wed)
by willy (subscriber, #9762)
[Link]
Posted Jul 24, 2019 10:24 UTC (Wed)
by james (subscriber, #1325)
[Link]
Commercial devices based on a zoned drive could have an embedded OS in the conventional zones to keep the bill of materials down.
Posted Jul 24, 2019 10:36 UTC (Wed)
by Sesse (subscriber, #53779)
[Link] (3 responses)
In particular, anything BigTable-like (generically called LSM) will never overwrite existing data on write; they'll just write a correction record (“whatever the address for user X used to be, now it's Y”, or “delete user X”), where the last value wins. When needed, they'll have garbage collection, where they read an entire chunk (or zone, in this case), prune out all the old values, and write a new tree. It fits perfectly well with the append-only nature of these zones, so if you know how the zones are structured, you take nearly no hit from using an SMR disk instead of a conventional disk.
This does really feel like a transient thing, though. Host-managed SMR disks are not really all that much used yet, and as they ramp up, it's into a future where flash keeps getting cheaper and cheaper, and eventually looks like it will be overtaking rotating media in terms of cost per byte.
Posted Jul 24, 2019 13:28 UTC (Wed)
by willy (subscriber, #9762)
[Link] (2 responses)
One of the larger pieces of cost for an SSD is the RAM used to run the FTL. If you can shrink the FTL by disallowing random writes, you save a lot of money. So we're going to see zoned SSDs too.
Posted Jul 25, 2019 10:52 UTC (Thu)
by ptman (subscriber, #57271)
[Link] (1 responses)
Posted Jul 25, 2019 21:04 UTC (Thu)
by Sesse (subscriber, #53779)
[Link]
Usually SLC is for speed and not endurance, by the way.
Posted Jul 24, 2019 14:40 UTC (Wed)
by Freeaqingme (subscriber, #103259)
[Link] (5 responses)
> The idea is that zonefs provides just enough of a filesystem for those developers [...]
In what way would those developers be hindered by using a regular FS?
Posted Jul 24, 2019 14:57 UTC (Wed)
by nivedita76 (guest, #121790)
[Link] (4 responses)
Posted Jul 24, 2019 22:36 UTC (Wed)
by epa (subscriber, #39769)
[Link]
Posted Jul 25, 2019 12:22 UTC (Thu)
by Baughn (subscriber, #124425)
[Link]
We might need a background process running something like a copying GC, which can't be retrofitted without significant work.
Posted Jul 27, 2019 5:33 UTC (Sat)
by flussence (guest, #85566)
[Link] (1 responses)
Posted Jul 31, 2019 17:42 UTC (Wed)
by anton (subscriber, #25547)
[Link]
The article refers to SMR HDDs. What I have seen in the HDD market in the last few years was that SMR drives were only a little larger than conventional drives, and looking for the largest drives (16TB) today, all offers are for conventional drives. It seems that SMR is on the way out. However, NAND flash only allows erasing big blocks, and despite announcements for years that we are going to get technologies without this restriction, NAND flash seems to be still going strong. However, this property of NAND flash is usually hidden behind a flash translation layer that makes it look like a normal block device.
Posted Jul 26, 2019 2:40 UTC (Fri)
by gdt (subscriber, #6284)
[Link]
Posted Jan 16, 2020 15:13 UTC (Thu)
by lzap (guest, #73396)
[Link]
Accessing zoned block devices with zonefs
Accessing zoned block devices with zonefs
Accessing zoned block devices with zonefs
Accessing zoned block devices with zonefs
Another use case is video surveillance, where you're continually recording video and want to keep as much as possible before you need to over-write.
Accessing zoned block devices with zonefs
Accessing zoned block devices with zonefs
Accessing zoned block devices with zonefs
Accessing zoned block devices with zonefs
Accessing zoned block devices with zonefs
Accessing zoned block devices with zonefs
Accessing zoned block devices with zonefs
Accessing zoned block devices with zonefs
Accessing zoned block devices with zonefs
Accessing zoned block devices with zonefs
Yes, classic log-structures file systems work with segments (i.e., zones) that are written sequentially. To reclaim space, the live data from some segments is copied to a new segment. COW file systems (e.g., BTRFS) generally use a similar organization as log-structured file systems, except for free-space management: COW file systems don't garbage-collect, but keep track of free blocks, and are therfore not so great on zoned devices.
Accessing zoned block devices with zonefs
Accessing zoned block devices with zonefs
Accessing zoned block devices with zonefs