Hi-ZNS: High Space Efficiency and Zero-Copy LSM-Tree Based Stores on ZNS SSDs

Renping Liu, Chongqing University of Posts and Telecommunications, China, liurp@cqupt.edu.cn

Junhua Chen, Chongqing University of Posts and Telecommunications, China, cheniujh@outlook.com

Peng Chen, Chongqing University of Posts and Telecommunications, China, chenpeng@cqupt.edu.cn

Linbo Long, Chongqing University of Posts and Telecommunications, China, longlb@cqupt.edu.cn

Anping Xiong, Chongqing University of Posts and Telecommunications, China, xiongap@cqupt.edu.cn

Duo Liu, Chongqing University, China, liuduo@cqu.edu.cn

DOI: https://doi.org/10.1145/3673038.3673096
ICPP '24: The 53rd International Conference on Parallel Processing, Gotland, Sweden, August 2024

The Zoned Namespace (ZNS) SSD is a newly introduced storage device and provides several new ZNS commands to upper-level applications. Zone-reset command is one of the ZNS commands to erase all the flash blocks within a zone. Since data is grouped and erased in zone units, ZNS SSDs are widely used in LSM-tree-based stores. However, the basic invalidated unit in LSM-tree is an SST/WAL file, which mismatches the erasing unit of a ZNS SSD. Placing different SST/WAL files in the same zone, LSM-tree on ZNS SSDs faces dramatic space amplification and extensive data migration problems.

To solve these problems, Hi-ZNS aligns zone sizes with varying-sizes SST/WAL files without changing the ZNS specification. The basic idea of Hi-ZNS is to put an SST/WAL file in a single zone. When an SST/WAL is invalidated, upper-level applications with Hi-ZNS send the zone-reset command to a ZNS SSD device mindlessly without any space amplification and data migration. Specifically, Hi-ZNS allocates physical resources for the zone on demand and provides an infinite logical zone number to avoid wasting storage resources. The extensive evaluation demonstrates that Hi-ZNS substantially improves space efficiency and completely eliminates data migration. Compared with the existing ZenFS&RocksDB, Hi-ZNS increases the maximum completed requests by up to 2.97 × and improves the I/O performance by up to 7%.

Keywords: ZNS SSD, LSM-tree, Space Efficiency, Data Migration

ACM Reference Format:
Renping Liu, Junhua Chen, Peng Chen, Linbo Long, Anping Xiong, and Duo Liu. 2024. Hi-ZNS: High Space Efficiency and Zero-Copy LSM-Tree Based Stores on ZNS SSDs. In The 53rd International Conference on Parallel Processing (ICPP '24), August 12--15, 2024, Gotland, Sweden. ACM, New York, NY, USA 10 Pages. https://doi.org/10.1145/3673038.3673096

1 INTRODUCTION

ZNS SSDs divide the logical address space into fixed-sized zones, and each zone must be written sequentially for flash-memory-friendly access [7] [17] [8] [19]. The coarse-grained management method reduces the capacity of over-provisioning and on-board DRAM and avoids in-device garbage collection [2]. Currently, ZNS SSDs are considered to be LSM-tree-friendly storage devices [2]. However, it has been observed that placing SST/WAL files with different lifetimes in one zone results in LSM-tree-based stores on ZNS SSDs facing dramatic space amplification and extensive data migration problems. Therefore, space management within a ZNS SSD device should be carefully considered.

LSM-tree-based stores, such as RocksDB, provide the lifetime hints of SST/WAL files for grouping data with the same lifetime in a zone. Unfortunately, the SST/WAL files with the same lifetime hint do not always have a closely aligned lifetime, which is the primary factor behind space amplification and data migration. Therefore, several researches [16] [13] [4] break down the features of LSM-tree in depth and combine them with the ZNS characteristics to further ensure that the data with the similar lifetime is grouped in the same zone. ZoneKV [16] proposes a lifetime-based zone storage model and a level-specific zone allocation algorithm to store SSTs with a similar lifetime in the same zone. CAZA [12] leverages the regularity of LSM-tree compaction strategies to predict the lifetime of SST files and introduces a compaction-aware zone allocation strategy. LL-Compaction [11] alters the compaction strategy to minimize the presence of long-lived SST files to accelerate the triggering of zone resets.

Figure 1: ZNS SSD and LSM-tree Background

Although these methods have produced commendable performance improvement, they are constrained by a fundamental limitation: the basic invalidated unit in LSM-tree is an SST/WAL file while the erasing unit of ZNS SSDs is a zone. Therefore, placing multiple SST/WAL files in a single zone inevitably postpones zone reset and even results in data migration. Bae et al. [1] advocate configuring the zones as small as possible, called a small zone, to address the long delay issue of zone reset. SplitZNS [9] introduces small zones by tweaking the zone-to-chip mapping to maximize garbage efficiency for LSM-tree on ZNS SSDs. However, a small zone still fails to completely eliminate data migration and sacrifices the parallelism of a ZNS SSD device, which brings I/O performance degradation problems.

In this paper, we proposed Hi-ZNS, a high space efficiency and zero-copy mechanism for LSM-tree on ZNS SSDs to reduce space amplification and completely eliminate data migration. The basic idea of Hi-ZNS is to place only one SST/WAL file in a zone. Specifically, Hi-ZNS contains two key technologies. (1) Construct a zone at runtime: based on the varying-sizes SST/WAL files, Hi-ZNS allocates different number of flash blocks for a zone to place the SST/WAL file. Note that according to the ZNS specification, Hi-ZNS ensures the logical zone size is fixed and only adjusts the amount of flash blocks that are allocated to a zone. (2) Provide an infinite logical zone space: since the logical zone size is not equal to the allocated flash resources in our design, Hi-ZNS keeps track of the available flash resources in a ZNS SSD device. As long as the resources remain, Hi-ZNS could allocate a logical zone (by increasing the zone number) to fully utilize the flash resources. As a result, when an SST/WAL is invalidated, upper-level applications with Hi-ZNS send the zone-reset command to a ZNS SSD device mindlessly. Extensive experimental results show that Hi-ZNS effectively improves the space efficiency and I/O throughput, and completely eliminates data migration.

In summary, this paper makes the following contributions:

We identify the space amplification and data migration problems on ZNS SSDs, and point out that limiting the SST/WAL file size from the application level (RocksDB) does not work.
We propose a high space efficiency and zero-copy mechanism for LSM-tree-based stores on ZNS SSDs, Hi-ZNS, featured with allocating resources on demand and infinite logical zone number.
We implement and evaluate Hi-ZNS to demonstrate that Hi-ZNS outperforms the existing method in both space utilization, I/O throughput, and data migration.

The remainder of this paper is organized as follows. Section 2 and Section 3 present the background and the motivation of this paper, respectively. In Section 4, we describe the details of Hi-ZNS. Section 5 presents the experiments and evaluations. Finally, we discuss related work in Section 6 and conclude this paper in Section 7 respectively.

2 BACKGROUND

In this section, we first introduce the internal organization of ZNS SSDs. We then give the background on LSM-tree-based stores. Finally, we describe the LSM-tree on ZNS SSDs.

Figure 2: Space utilization and data migration under six workloads

2.1 ZNS SSDs

The Zoned Namespace (ZNS) interface is a novel storage interface tailored to enhance the performance and lifespan of flash-based SSD devices by organizing data into zones instead of traditional blocks. The ZNS interface provides several new commands to upper-level applications for managing zones and accessing data within ZNS SSDs, including management commands and data transfer commands. As illustrated in Figure 1(a), zone-open command permits an application to explicitly open a zone and indicates to the device that the resources necessary for writing the zone should remain available. zone-close command marks the current open zone as full or read-only, preventing further writes to it. zone-finish allows an application to move a zone's write pointer to the end of the zone, preventing any further write operations to the zone until it is reset. zone-reset command is provided to host software to clean all the data in a zone.

For data transfer commands, write command writes data to the currently open zone following the write pointer. append command is designed to improve the write performance by allowing a host to simultaneously submit several zone write operations and let the device process these in any order. read command reads data from a specific zone in random order. Host interface iogic divides data transfer commands into one or more transactions in the unit of flash-page size and inserts the transactions into the device-level queue.

Compared with traditional block-interface-based SSDs, the ZNS interface divides the logical address space into several fixed-size zones as shown in Figure 1(a). Each zone is equipped with a write pointer to monitor the position of the next write and sequential writes are enforced within the zone [18] [17] [3] [20]. The zone capacity denotes the total number of logical blocks within a zone, which is always smaller than or equal to the zone size.

Figure 1(a) also shows a zone physical view from flash media. Generally, each ZNS SSD device contains several flash chips, with each chip comprising multiple flash blocks. Each block consists of multiple pages, which is the basic unit of data storing and fetching [10] [5] [22]. In ZNS SSDs, a zone consists of multiple flash blocks, which span across multiple flash chips to take advantage of chip-level parallelism. For example, as shown in Figure 1(a), zone #0 consists of flash blocks (block offset 0 to 4) from all flash chips, zone #1 consists of flash blocks (block offset 5 to 9) from all flash chips. Currently, there are no restrictions to construct a zone using flash blocks in the ZNS specification, which opens design space for different configurations [7] [14]. In this paper, Hi-ZNS utilizes a novel method to construct a zone in the device to improve the space efficiency of ZNS SSDs.

Figure 3: Actual size of SST/WAL files. Expect limiting the SST/WAL file size to match zone size, but failed.

2.2 LSM-tree based Stores

A log-structured merge tree (LSM-tree) is a data structure designed to efficiently store key-value pairs for retrieval in disk- or flash-based storage systems. LSM-tree optimizes both read and write operations through a blend of in-memory and disk-based structures. As shown in Figure 1(b), incoming data is initially stored in an in-memory structure known as a memtable, serving as a temporary sorting area. As the memtable reaches capacity, its contents are flushed to disks in a batch of sorted data structures referred to as Sorted String Tables (SSTables/SSTs). SSTs form the core of the LSM-tree, organizing data in sorted order to facilitate efficient queries and range scans. Each SST file represents a snapshot of data at a specific moment in time. To optimize data management, SST files are arranged into levels (refer to L0, L1, L2 to Ln). Lower levels contain more recent data and higher levels house compacted data. This hierarchical approach ensures a balance between read and write performance, averting potential bottlenecks.

There are numerous well-known examples of the LSM-tree, including Apache Cassandra, RocksDB, LevelDB, HBase, Couchbase, and more. We choose RocksDB as the experimental subject to investigate the performance of LSM-tree on ZNS SSDs. RocksDB [6] is a high-performance, persistent key-value storage engine optimized for the specific traits of flash media, catering to large-scale (distributed) applications. In this paper, Hi-ZNS mainly focuses on enhancing performance during flushing SST files to ZNS SSDs.

2.3 LSM-tree on ZNS SSDs

RocksDB is a good fit for ZNS SSDs [9] [23] [15]. ZenFS [2] is a file system plugin that utilizes RocksDB's filesystem interface to place files into zones on a raw zoned block device. As shown in Figure 1(b), ZenFS builds a bridge between LSM-tree-based stores and ZNS SSDs. By employing lifetime hints to co-locate SST files with similar lifetimes into the same zone, ZenFS significantly mitigates system write amplification compared to conventional block-interfaced-based SSDs [2]. Furthermore, ZenFS eliminates background garbage collection within the file system or on the ZNS SSD device, thereby enhancing performance in terms of throughput, tail latencies, and device endurance. In this paper, Hi-ZNS extensively investigates the zone allocation strategy of ZenFS and integrates it with the characteristics of ZNS SSDs to enhance space utilization and minimize data migration for LSM-tree-based stores (such as RocksDB) on ZNS SSDs.

3 MOTIVATION

Figure 4: The problem: data migration and space amplification.

3.1 The Problem: Low Space Efficiency and Extensive Data Migration

The ZNS interface provides a new zone-reset command for upper applications to erase all the flash blocks within a zone, which could reclaim the zone space for rewriting new data. That is to say, compared with the traditional block-interface SSDs, the erase unit of ZNS SSDs is a zone, not a flash block. Therefore, ZNS SSDs aim to group the data with a similar lifetime into a single zone. Once the data in the same zone is invalid, ZNS SSDs reset the zone directly without the need to move and rearrange data.

Figure 4(a) shows the expected situation, in which all data within a zone is invalid data. However, in reality, invalid data and valid data are always mixed together in a zone as shown in Figure 4(b), which makes a zone fail to be reset directly. With the invalid data increasing and the free space decreasing, upper applications would start garbage collection (GC) to reclaim zone space for writing the next SST/WAL files. Figure 4(c) shows the GC process, which migrates valid data (L1) from zone #1 to zone #n then reset zone #1. The next SST/WAL files would be written to zone #1. Furthermore, when a zone contains a large amount of valid data (as shown in Figure 4(d)), data migration will cause huge overhead. If invalid data resides in a zone for a long time, the occupied space fails to be reused until the zone is reset, which is called space amplification.

3.2 The quantification of Space Amplification and Data Migration

To illustrate data migration and space amplification of LSM-tree on ZNS SSDs, we evaluate six different workloads as shown in Figure 2. The detailed configurations of the ZNS SSDs and the experimental method are described in Section 5. In Figure 2, ValidData means the total volume of valid data in all the zones, InvalidData means the total volume of invalid data that still resides in all the zones, and Data Migration means the total volume of data migration at all the GC processes. Due to the placement of multiple SST or WAL files within the same zone, these files can not be invalidated simultaneously, resulting in garbage data occupying the zone space and lowering the space efficiency. Furthermore, the accumulation of garbage data results in extensive data migration during garbage collection of ZenFS, which reduces the I/O throughput of RocksDB.

Although RocksDB provides the lifetime hints of SST/WAL files to ZenFS for grouping data with the same lifetimes in a zone, ZenFS still faces the following problems in reality: (1) coarse-grained lifetime hints inaccurately separate data. For instance, at LSM-tree high levels (above L3), SST files have the same lifetime hint but do not have the same lifetime. Combining data from different lifetimes within a zone postpones the zone being reset, leading to low space efficiency and the need for extensive data migration during the compaction process. (2) ZenFS employs a lifetime-hint-based strategy that permits a zone to store data with a lower lifetime-hint value instead of strictly requiring the lifetime-hint to be equal. The write-ahead-log (WAL) files, which are considered the hottest data with a low lifetime-hint value, are frequently mixed with SST files within the same zone, further exacerbating space amplification and data migration problems.

3.3 Limiting SST Flies Size Does Not Work

An effective approach for maintaining a consistent data lifetime within a zone is to place only one SST/WAL file per zone while ensuring that the sizes of these files match the zone size to maximize space utilization. RocksDB offers two parameters to manage the SST/WAL file size: "TargetFileSizeBase" limits the file size, while "TargetFileSizeMultiple" scales up the file size.

We conducted experiments with the two parameters, as the detailed setting in Table 1. With a zone size of 512MB and the expected SST/WAL file size of 512MB (TargetFileSizeBase=512MB, TargetFileSizeMultiple=1), the results unfortunately indicate that only a fraction of the created SST and WAL files adhere to the intended size, with numerous SST files deviating significantly from the desired 512MB, as illustrated in Figure 3.

Therefore, we find that the zone size is always much larger than the SST/WAL file size, which allows multiple SST/WAL files to be stored in the same zone. As time passes, this mismatch results in a zone partially invalidated and fragmented, causing low space efficiency and extensive data migration within a ZNS SSD device. In this paper, we propose a novel mechanism to solve the two problems.

4 DESIGN OF HI-ZNS

The basic idea of Hi-ZNS is to place a single SST/WAL file in a single zone. When the SST/WAL file is invalidated, the zone can be reset directly without any data migration and space amplification. The key technique problem is how to fully utilize the flash resources in a ZNS SSD device and follow the ZNS specification. In this section, we first illustrate the basic idea of Hi-ZNS. Then we give the overview and introduce the details of Hi-ZNS.

Figure 5 shows the basic idea of Hi-ZNS. The ZNS specification requires the zone size should be fixed for a single ZNS SSD device. Therefore, Hi-ZNS obeys the specification and proposes physical zone space to self-manage the flash resources. Specifically, Hi-ZNS maintains a fixed zone size in logical space. However, from a physical space perspective, Hi-ZNS dynamically allocates flash blocks (referred to as Strip, see the next subsection) to accommodate the SST/WAL file on demand. In other words, the zone size is not equal to the space that is mapped to the specific flash blocks. When an SST/WAL file becomes invalidated, the allocated flash blocks are erased directly without any data migration.

4.1 Hi-ZNS Overview

Figure 6 shows an overview of Hi-ZNS. Upper applications send Get, Put, Delete commands to LSM-tree-based stores (such as RocksDB, LevelDB,...) for reading, writing, and deleting key-value pairs on ZNS SSDs. Hi-ZNS spans ZenFS and ZNS SSDs to ensure the efficiency of data access.

At the ZNS SSD device level, Hi-ZNS utilizes Strip to manage the flash resources. A Strip consists of only one flash block (block offset 0) from all flash chips, which is a more fine-grained resource management method compared with the traditional zone-based method. Strip Allocator resides in a ZNS SSD device to record the valid/invalid status of all the strips. Due to Hi-ZNS mapping strips to a zone at runtime, Zone mapping table records the mapping between the logic zone number and the strips.

At ZenFS level, since the zone size is not equal to the space that is mapped to the specific flash blocks in Hi-ZNS, Device Capacity Maintainer records the actual available capacity of the ZNS SSD device. Zone Management is a modified module of ZenFS to provide an infinite logic zone number for ensuring that all the strips are used.

4.2 Allocating Strips on Demand

Hi-ZNS dynamically constructs a physical zone on demand. As shown in Figure 7(a), initially, a logical zone is in the empty state and is not mapped to any strips within the device. When an SST/WAL file targets the zone, Hi-ZNS circularly allocates strips to the zone one by one at runtime. Once the SST/WAL file writing is finished, the zone is constructed by several strips, and the corresponding mapping information is recorded in the zone mapping table as shown in Figure 6.

Strip is the allocated unit of Hi-ZNS. However, an SST/WAL file size is not always strictly aligned with the strip size. Consequently, when an SST/WAL file is written to a zone, a small portion of the last strip is not used, as illustrated in Figure 7(c). We regard it as "fragment". We will discuss the fragment in Section 4.5 and evaluate it in Section 5.

4.3 Allocating Zones in ZenFS

Hi-ZNS discards the lifetime hints that are from RocksDB. Upon writing an SST/WAL file, Hi-ZNS allocates a new empty logic zone to hold the file. When the SST/WAL file is written finished, the zone state transitions to full, which is defined by the ZNS specification and means a zone has been finished by the host using the zone-finish command. It ensures a zone only containing an SST/WAL file at ZenFS level, and the zone can be reset immediately without any data migration once the SST/WAL file is invalidated. For other types of files, Hi-ZNS simply puts them to the same zone because their size is too small.

4.4 Infinite Logical Zone Number

Hi-ZNS utilizes a zone to store an SST/WAL file and determines the actual zone mapping table at run time according to the SST/WAL file size. However, the SST/WAL file size is uncertain as illustrated in Figure 3. When these uncertain-size files are placed into the zones, each zone might be allocated to a different number of strips. Therefore, Hi-ZNS could not determine the exact number of strips for each zone in advance.

To ensure the full use of the strips in a ZNS SSD device, Hi-ZNS regards the logical zone number as infinite. As long as the strip remains, the logical zone number could continue to increase to use these strips. Hi-ZNS maintains the remaining number of strips using Device Capacity Maintainer. At initialization, Hi-ZNS sends the "zbd report" command to the ZNS SSD device to get two parameters: (1) total capacity of the ZNS SSD device, (2) strip size. After that, Hi-ZNS maintains the number of the available strips by zone-finish and zone-reset commands, which are from the ZNS specification.

Algorithm 1

4.5 Parallelism and fragment

Hi-ZNS utilizes strip as the basic unit to construct a zone. To leverage the internal parallelism of ZNS SSDs, a strip is composed of multiple flash blocks that are from different flash chips. In this paper, we mainly consider the chip-level parallelism within a ZNS SSD device. Each strip has a "width" attribute, which corresponds to the number of flash blocks it contains. As shown in Figure 6, the width of the strip is M, and the full chip-level parallelism also is M. Hi-ZNS has two typical configurations to form the strip, including the full-parallelism strip and the partial-parallelism strip, as shown in Figure 8.

Figure 8: Full-parallelism strip and partial-parallelism strip

Full-parallelism strip. The strip width is equal to the total number of flash chips in a device, allowing each strip to utilize the full parallelism of the device. Full-parallelism strip has a larger capacity but tends to generate a larger fragment. This is because that Hi-ZNS allocates strips on-demand at runtime and a smaller strip makes it easier to match SST/WAL file sizes. Although full-parallelism strips result in an increase in the total size of the fragment, it still boosts the space efficiency without decreasing the I/O performance (Section 5).

Partial-parallelism strip. The strip width is less than the total number of flash chips in a device, indicating that each strip can only exploit a portion of the device's parallelism. As shown in Figure 8, partial-parallelism strips have a smaller size and could reduce the fragment size because it is easier to match SST/WAL file sizes. However, the partial-parallelism strip sacrifices parallelism and decreases the I/O performance of a ZNS SSD device. Therefore, Hi-ZNS utilizes the full-parallelism strip to construct a zone at first. Then, we discuss the partial-parallelism strip in section 5.

Figure 9: Comparision of Hi-ZNS and ZenFS (GC Enabled) in space utilization and data migration

5 EVALUATION

In this section, we first give the experimental setup. Secondly, we analyze the performance of Hi-ZNS, including zero data migration, space utilization, and the maximum completed requests under several famous workloads. Then, we evaluate the I/O performance of Hi-ZNS under the full-parallelism strips. At last, we discuss the fragment rate, the throughput under the different type strips, and the mapping table overhead.

5.1 Experimental Setup

Table 1: Configuration

FEMU Configuration	FEMU version: 8.0
	Linux kernel: 5.11
ZNS SSD Configuration	#of channels: 8; #of chips/channel: 2
	#dies/chip: 1; #planes/die: 16
	#blocks/plane: 512; #pages/block: 256
	#page capacity: 4KB
	Total capacity: 128GB
Zone Information	Zone size: 512MB
	Total zone number: 256
Flash latency	Read latency: 40us
	Program latency: 200us
	Erase latency: 2000us
Application	RocksDB Version: 8.8
	ZenFS Version: 2.1.4
db_bench	TargetFileSizeBase: 512M
	TargetFileSizeMultiple: 1
	KeySize: 128B; ValueSize: 8192B
	BackgroundFlushThreads: 8
	BackgroundCompactThreads: 8
	I/O Mode: Direct
	Seed: 1695295409170318
Host	16v cores (2.40GHz)
	192GB DRAM

We implement Hi-ZNS based on FEMU, which is a full I/O stack simulator. Table 1 shows the configuration of the target ZNS SSD device. The benchmark tool, $db\_bench$ in RocksDB, is used to generate different workloads for evaluating the effectiveness of Hi-ZNS. The parameters used by $db\_bench$ are detailed in Table 1. For the generating combined workloads, fillseq and fillrandom use the "num" parameter to set 6 million and 12 million, respectively. These two workloads initialize the write data into the device to serve for running overwrite and updaterandom workloads. For instance, the fillseq+overwrite workload entails initially executing fillseq to fill 6 million requests, followed by the overwrite workload with a "num" parameter of 50 million.

To simplify the notation, we adopt abbreviated forms. Specifically, "fillseq+overwrite" is shortened to "seq+ow", "fillseq+updater-

andom" is shortened to "seq+up", "fillrandom+overwrite" is shortened to "ran+ow", and "fillrandom+updaterandom" is shortened to "ran+up". Additionally, we also add the garbage collection (GC) thread to observe whether data migration exists in Hi-ZNS. It is worth noting that previous researches all have data migration. Therefore, we mainly evaluate RocksDB over ZenFS with/without GC.

5.2 Zero Data Migration

As shown in Figure 9, Hi-ZNS shows the highest space efficiency under the six different workloads. The efficiency of Hi-ZNS is achieved through reset zone immediately upon deletion of SST/WAL files, ensuring timely release of occupied space previously taken by garbage data. During the running process of the six workloads, the amount of garbage data retained within Hi-ZNS is negligible (maximum of 1.7MB), which is from small files like DBTMP, MANIFEST, and CURRENT. These files reside in the same zone and do not directly trigger a zone reset upon deletion. Therefore, the GC thread in Hi-ZNS would not migrate any data.

5.3 Space Utilization

Figure 9 also demonstrates that Hi-ZNS could store more valid data in a ZNS SSD device with identical storage capacity. Furthermore, as shown in Figure 9, ZenFS (GC Enabled) always stops running early because much garbage data can not be collected. Hi-ZNS could extend the running time to write more data into a ZNS SSD device. This is because Hi-ZNS resets the zone immediately after the SST/WAL file is deleted, thereby freeing up space for storing new SST/WAL files more quickly. Therefore, Hi-ZNS substantially improves space efficiency.

5.4 Maximum Completed Requests

Figure 10: The number of completed requests

Figure 10 shows the number of completed requests. Hi-ZNS with exceptional space utilization efficiency allows the ZNS SSD device to accommodate a larger volume of write operations, resulting in a significant enhancement in the total number of requests that can be processed in RocksDB. Compared to ZenFS without garbage collection, Hi-ZNS shows a substantial improvement in the number of completed requests. Under the workloads of fillseq, fillrandom, seq+ow, seq+up, ran+ow, and ran+up, the respective improvements are 2.16x, 1.76x, 2.97x, 2.96x, 2.28x, and 2.84x. Furthermore, when ZenFS enables garbage collection, it incurs data migration as a trade-off to complete more write requests. The volume of the data migration in ZenFS with GC is 17GB, 108GB, 126GB, 166GB, 104GB, and 118GB under the six workloads. Note that although ZenFS with GC improves the number of completed requests, Hi-ZNS still significantly outperforms it in all the workloads.

5.5 I/O Performance

To evaluate the I/O performance of Hi-ZNS, we conduct experiments comparing to the existing ZenFS, including ZenFS without GC, and ZenFS with GC. Due to the varying number of completed requests among Hi-ZNS, ZenFS, and ZenFS with GC, we use the segmented method to ensure the consistency of the key range and key distribution. For instance, in the "fillrandom" workload, the ’num’ parameter in db_bench is consistently set at 50 million. However, ZenFS without GC could only complete up to 19.5 million requests. Therefore, we evaluate the throughput which is the average of requests ranging from 0 to 19 million. Similarly, other workloads follow this testing method. The results of the I/O performance are illustrated in Figure 11.

In the comparison of Hi-ZNS and ZenFS without GC (Figure 11(a)), the I/O performance of both was nearly identical across most workloads. This is because Hi-ZNS can fully leverage the device's parallelism by taking full-parallelism strips to reconstruct a zone, ensuring no decrease in I/O performance.

In the comparison of Hi-ZNS and ZenFS with GC (Figure 11(b)), Hi-ZNS exhibits marginal improvements in each workload. Specifically, it shows improvements of 2.5% (fillseq), 4.1% (fillrandom), 4.3% (seq+ow), 5.4% (seq+up), 4.7% (ran+ow), and 7.0% (ran+up) compared to the baseline performance. The performance improvement is due to ZenFS's GC thread conducting data migration. Although Hi-ZNS also engages a GC thread, it performs no data migration, resulting in superior I/O performance.

Figure 12: Fragment rate under the different strip sizes

5.6 Fragment Discussion

In our simulated ZNS SSD device (Table 1), there are 8 channels and 2 chips in each channel (totaling 16 flash chips). In this study, the full-parallelism strip comprises 16 flash blocks sourced from these 16 flash chips. Each flash block has a capacity of 1MB, calculated by multiplying the page capacity (4KB) by the page count of a block (256). We discuss the performance among the full-parallelism strip (16M), the semi-parallelism strip (8M), and the one-fourth-parallelism strip (4M).

Figure 12 shows the fragment rate of these three strips under the six workloads. As described in Section 4, a strip with a smaller size could decrease the fragment rate because it is easier to match SST/WAL file sizes. We also find that the fragment rate of Hi-ZNS remains a low (2.5%-3.9%) rate across most workloads even with the full-parallelism strip (16M). However, fillseq shows an interesting phenomenon, which has the highest fragment rate (27.4%) with the full-parallelism strip. In reality, fillseq workload is unlikely to occur because it is a completely sequential access behavior. Furthermore, even though fillseq workload with the full-parallelism strip faces a higher fragment rate, the one-fourth-parallelism strip can lower the fragment rate to 3.2

Figure 13: Throughput under the different strip sizes

Figure 14: The number of the completed requests under the different strip sizes

Figure 13 shows the throughput of these three strips under the six workloads. The full-parallelism strip could utilize the full parallelism of the device and achieve the highest throughput. Comparing with the semi-parallelism strip and the one-fourth-parallelism strip, the full parallelism improves I/O performance by 12.9% and 40.1%(fillseq), 11.9% and 25.6% (fillrandom), 7.3% and 27.7% (seq+ow), 2.2% and 14.9% (seq+up), 15.3% and 28.1% (ran+ow), 8.5% and 22.6% (ran+up). Similarly, fillseq workload has the highest throughput due to its completely sequential access behavior.

Figure 14 shows the number of completed requests under the six workloads. Most workloads have a similar completed request except fillseq. There's still a little confusion. We believe that the smaller fragments will complete more requests. However, for seq+ow, the completed requests under the semi-parallelism strip (8M) are more than the one-fourth-parallelism strip (4M). We guess that (1) 8M strips and 4M strips already have high space utilization and the dropped fragment rate is so small (about 0.9%). (2) it is hard to guarantee that the running environment is consistent completely from software to hardware (such as different seeds). Due to these workloads with a similar completed request, we recommend using the full-parallelism strip as much as possible from the perspective of performance and fragment.

5.7 Mapping Table Overhead

Let's consider an extreme scenario where the host continually expands the logical zone number and only writes a small-sized SST/WAL file into each zone. This leads to each zone only being mapped to one strip until the device is filled. Eventually, the number of logical zones becomes equal to the total number of strips in a ZNS SSD device, causing the mapping table to expand to its maximum size. Each zone item in the mapping table has a base size of 8 bytes, and each strip requires an 8-byte pointer for locating. In this paper, we simulated a 128GB ZNS SSD and estimated the actual size of the mapping table to be 128KB (8192 * (8B + 8B)), which occupies a mere 0.00009% of the total device capacity. In practice, the mapping table overhead is typically much lower than this value since such extreme scenarios are highly unlikely to occur.

6 RELATED WORK

There are various previous works related to Hi-ZNS. Most of the related works lie in the following several areas: RocksDB on ZNS SSDs, space management at the application level, and small zones for reducing data migration. In this section, we discuss these related works.

6.1 RocksDB on ZNS SSDs

The new emerging ZNS interface divides the logical address space into multiple zones and provides a serial new command to upper-level applications to operate these zones. Compared with the traditional block interface, the ZNS interface is more friendly to flash media. Currently, several recent studies deploy efficient space management from upper-level applications to optimize the performance of LSM-tree-based stores (like RocksDB [6]) using these new ZNS commands. Furthermore, ZenFS [2] is a file system plugin that utilizes RocksDB's file system interface to place files into zones on a raw-zoned block device. In ZenFS, zone allocation strategies are based on the lifetime hint provided by RocksDB. Files with the same lifetime hint are prioritized to be placed within the same zone. Compared to random file placements, this strategy ensures that data within a zone has a more uniform lifetime, accelerating zone reclaims and garbage data cleanup. In this paper, we use $db\_bench$ in RocksDB to generate different workloads, to evaluate the effectiveness of Hi-ZNS.

6.2 Space Management at The Application Level

Since files with the same lifetime hint do not always have closely aligned lifetimes [9], ZenFS still encounters space amplification and frequent garbage collection issues. To alleviate these issues, several researches primarily concentrate on distinguishing the lifetimes of SSTs in some fine-grained manners [11] [16] [12] [11], aiming to allocate data with similar lifetimes to the same zone. ZoneKV [16] utilizes a lifetime-based zone storage model and a level-specific zone allocation algorithm to store data to reduce space amplification and maintain higher throughput. Lee et al. [12] propose a novel compaction-aware zone allocation algorithm to co-locate data with overlapping key ranges in the same zone, thus providing a more precise prediction of SSTable lifetimes. LL-Compaction [11] proposes a lifetime-leveling Compaction algorithm for LSM-tree on ZNS SSDs, which allocates dedicated zones for each level since different levels have different lifetime SSTs and avoids mixing long-lived SSTs and short-lived SSTs in the same zone. Unfortunately, it is hard to put the same lifetime SST/WAL files into a zone perfectly and fail to completely eliminate data migration.

6.3 Small Zones for Reducing Data Migration

A small zone configuration means the zone size is as small as possible. Due to the smaller zone size, a file would easily fill the entire zone and the data within the small zone has a higher possibility to be invalidated at the same time. When zone-reset commands arrive, ZNS SSD devices can directly erase all the flash blocks within the small zones without redundant data migration. Small zone configurations are better than large zone configurations as large zones can reduce the degree of freedom for upper-level application data placement. Hanyeoreum et al. [1] focus on analyzing the challenges of ZNS using small zones and studies a simple, but efficient scheduling mechanism to improve the parallelism by being aware of inter-zone interference. SplitZNS [9] introduce small zones by adjusting the zone-to-chip mapping to enhance garbage collection efficiency for LSM-trees on ZNS SSDs. Capitalizing on the multi-level nature of LSM-trees and the inherent parallel architecture of ZNS SSDs, SplitZNS proposes several techniques to harness and expedite small zones, thereby mitigating the performance impact resulting from underutilized parallelism. FlexZNS [21] provides reliable zoned storage allowing host software to configure the zone size flexibly as well as multiple zone sizes to reduce the overhead of data migrations during zone garbage collection. Han et al.[8] propose a new checkpoint scheme based on ZNS SSDs with small zones, which could reduce interference when multiple dockers perform checkpoints concurrently by workload separation and performance isolation.

However, a small zone configuration fails to fully exploit the multi-level parallelism inherent in flash media, leading to diminished read and write performance of ZNS SSDs. Furthermore, the small zone configuration still could not completely eliminate data migration, because the zone size is fixed and always fails to align the varying SST/WAL sizes. In this paper, Hi-ZNS leverages the full parallelism of flash media to reconstruct a zone and allocate the resources on demand, which guarantees both high access performance and space efficiency.

7 CONCLUSION

This paper presents Hi-ZNS, a high space efficiency and zero-copy mechanism for LSM-tree-based stores on ZNS SSDs. The basic idea of Hi-ZNS is to allocate a single zone for each SST/WAL file. By allocating physical resources on demand and providing an infinite logical zone number, Hi-ZNS avoids storage resource wastage. Extensive evaluations have shown that Hi-ZNS significantly enhances space efficiency and completely eliminates data migration.

ACKNOWLEDGMENTS

We would like to thank the anonymous reviewers for their valuable feedback and improvements to this paper. This work was supported by the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant No. KJQN202300641, KJQN202

300643). Natural Science Foundation of Chongqing (2022NSCQ-MSX2502).

REFERENCES

Hanyeoreum Bae, Jiseon Kim, Miryeong Kwon, and Myoungsoo Jung. 2022. What you can't forget: exploiting parallelism for zoned namespaces. In Proceedings of the 14th ACM Workshop on Hot Topics in Storage and File Systems. 79–85.
Matias Bjørling, Abutalib Aghayev, Hans Holmberg, Aravind Ramesh, Damien Le Moal, Gregory R Ganger, and George Amvrosiadis. 2021. ZNS: Avoiding the block interface tax for flash-based SSDs. In 2021 USENIX Annual Technical Conference (USENIX ATC’21). 689–703.
Sungjin Byeon, Joseph Ro, Safdar Jamil, Jeong-Uk Kang, and Youngjae Kim. 2023. A free-space adaptive runtime zone-reset algorithm for enhanced ZNS efficiency. In Proceedings of the 15th ACM Workshop on Hot Topics in Storage and File Systems. 109–115.
Gunhee Choi, Kwanghee Lee, Myunghoon Oh, Jongmoo Choi, Jhuyeong Jhin, and Yongseok Oh. 2020. A New LSM-style Garbage Collection Scheme for ZNS SSDs. In 12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 20).
Jinhua Cui, Youtao Zhang, Liang Shi, Chun Jason Xue, Weiguo Wu, and Jun Yang. 2017. Approxftl: On the performance and lifetime improvement of 3-d nand flash-based ssds. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD) 37, 10 (2017), 1957–1970.
Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. Rocksdb: Evolution of development priorities in a key-value store serving large-scale applications. ACM Transactions on Storage (TOS) 17, 4 (2021), 1–32.
Kyuhwa Han, Hyunho Gwak, Dongkun Shin, and Jooyoung Hwang. 2021. ZNS+: Advanced zoned namespace interface for supporting in-storage zone compaction. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI’21). 147–162.
Yejin Han, Myunghoon Oh, Jaedong Lee, Seehwan Yoo, Bryan S Kim, and Jongmoo Choi. 2023. Achieving Performance Isolation in Docker Environments with ZNS SSDs. In 2023 IEEE 12th Non-Volatile Memory Systems and Applications Symposium (NVMSA). IEEE, 25–31.
Dong Huang, Dan Feng, Qiankun Liu, Bo Ding, Wei Zhao, Xueliang Wei, and Wei Tong. 2023. SplitZNS: Towards an efficient LSM-tree on zoned namespace SSDs. ACM Transactions on Architecture and Code Optimization 20, 3 (2023), 1–26.
Cheng Ji, Li-Pin Chang, Liang Shi, Chao Wu, Qiao Li, and Chun Jason Xue. 2016. An empirical study of File-System fragmentation in mobile storage systems. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16).
Jeeyoon Jung and Dongkun Shin. 2022. Lifetime-leveling LSM-tree compaction for ZNS SSD. In Proceedings of the 14th ACM Workshop on Hot Topics in Storage and File Systems (HotStorage’22). 100–105.
Hee-Rock Lee, Chang-Gyu Lee, Seungjin Lee, and Youngjae Kim. 2022. Compaction-aware zone allocation for LSM based key-value store on ZNS SSDs. In Proceedings of the 14th ACM Workshop on Hot Topics in Storage and File Systems. 93–99.
Biyong Liu, Yuan Xia, Xueliang Wei, and Wei Tong. 2023. LifetimeKV: Narrowing the Lifetime Gap of SSTs in LSMT-based KV Stores for ZNS SSDs. In 2023 IEEE 41st International Conference on Computer Design (ICCD). IEEE, 300–307.
Renping Liu, Zhenhua Tan, Yan Shen, Linbo Long, and Duo Liu. 2022. Fair-zns: Enhancing fairness in zns ssds through self-balancing I/O scheduling. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2022).
Linbo Long, Shuiyong He, Jingcheng Shen, Renping Liu, Zhenhua Tan, Congming Gao, Duo Liu, Kan Zhong, and Yi Jiang. 2024. WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs. ACM Transactions on Architecture and Code Optimization 21, 1 (2024), 1–23.
Mingchen Lu, Peiquan Jin, Xiaoliang Wang, Yongping Luo, and Kuankuan Guo. 2023. ZoneKV: A Space-Efficient Key-Value Store for ZNS SSDs. In 2023 60th ACM/IEEE Design Automation Conference (DAC’23). IEEE, 1–6.
Jaehong Min, Chenxingyu Zhao, Ming Liu, and Arvind Krishnamurthy. 2023. eZNS: An Elastic Zoned Namespace for Commodity { ZNS}{ SSDs}. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI’23). 461–477.
Devashish Purandare, Pete Wilcox, Heiner Litz, and Shel Finkelstein. 2022. Append is near: Log-based data management on zns ssds. In 12th Annual Conference on Innovative Data Systems Research (CIDR’22).
Dongjoo Seo, Ping-Xiang Chen, Huaicheng Li, Matias Bjørling, and Nikil Dutt. 2023. Is garbage collection overhead gone? case study of F2FS on ZNS SSDs. In Proceedings of the 15th ACM Workshop on Hot Topics in Storage and File Systems. 102–108.
Zhenhua Tan, Linbo Long, Renping Liu, Congming Gao, Yi Jiang, and Yan Liu. 2023. Optimizing Data Migration for Garbage Collection in ZNS SSDs. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1–2.
Yu Wang, You Zhou, Zhonghai Lu, Xiaoyi Zhang, Kun Wang, Feng Zhu, Shu Li, Changsheng Xie, and Fei Wu. 2023. FlexZNS: Building High-Performance ZNS SSDs with Size-Flexible and Parity-Protected Zones. In 2023 IEEE 41st International Conference on Computer Design (ICCD). IEEE, 291–299.
Chao Wu, Cheng Ji, Qiao Li, Congming Gao, Riwei Pan, Chenchen Fu, Liang Shi, and Chun Jason Xue. 2019. Maximizing I/O throughput and minimizing performance variation via reinforcement learning based I/O merging for SSDs. IEEE Trans. Comput. 69, 1 (2019), 72–86.
Denghui Wu, Biyong Liu, Wei Zhao, and Wei Tong. 2022. Znskv: Reducing data migration in lsmt-based kv stores on zns ssds. In 2022 IEEE 40th International Conference on Computer Design (ICCD). IEEE, 411–414.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

ICPP '24, August 12–15, 2024, Gotland, Sweden