[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Achieving Tunable Erasure Coding with Cluster-Aware Redundancy Transitioning

Published: 14 September 2024 Publication History

Abstract

Erasure coding has been demonstrated as a storage-efficient means against failures, yet its tunability remains a challenging issue in data centers, which is prone to induce substantial cross-cluster traffic. In this article, we present ClusterRT, a cluster-aware redundancy transitioning approach that can dynamically tailor the redundancy degree of erasure coding in data centers. ClusterRT formulates the data relocation as the maximum flow problem to reduce cross-cluster data transfers. It then designs a parity-coordinated update algorithm, which gathers the parity chunks within the same cluster and leverages encoding dependency to further decrease the cross-cluster update traffic. ClusterRT finally rotates the parity chunks to balance the cross-cluster transitioning traffic across the data center. Large-scale simulation and Alibaba Cloud ECS experiments show that ClusterRT reduces 94.0% to 96.2% of transitioning traffic and reduces 70.4% to 88.4% of transitioning time.

1 Introduction

Modern storage systems are usually deployed across thousands of storage nodes to support a variety of upper-layer applications, making failures, which are supposed to be accidental, now commonplace [10, 18, 42]. To ensure data reliability against unexpected failures, today’s storage systems [1, 2, 3, 15, 24, 25] extensively rely on erasure coding [6], which can assuredly achieve higher data reliability with the same storage overhead as conventional replication [36]. Generally, erasure coding can be configured via two integers, namely \(k\) and \(m\). A \((k,m)\) erasure code takes \(k\) data chunks each time to generate additional \(m\) redundant chunks (called parity chunks), such that the resulting \(k+m\) chunks collectively form a stripe (i.e., the coding group for fault tolerance). The \((k,m)\) erasure code ensures that the \(k\) original (uncoded) data chunks are always repairable by using any \(k\) chunks within the same stripe.
In addition to guaranteeing data reliability, erasure coding is required to be highly scalable as well, implying that it should support efficient switching between different erasure codes. Specifically, existing storage systems often deploy different erasure codes to balance the access performance [35, 41] (or the reliability guarantee [19, 20]) and storage overhead. For example, disk reliability varies with the aging of disks (e.g., the annual failure rate follows a bathtub curve); at this time, the storage systems usually have to adjust the encoding parameters to sustain the same level of data reliability.
In this article, we study the problem of redundancy transitioning [19, 20, 35, 38, 39, 40, 41] that stretches the stripes by adding more data chunks without comprising the number of tolerable failures (by increasing \(k\) to \(k^{\prime }\) and keeping \(m\) untouched, where \(k^{\prime }\gt k\)). The redundancy transitioning is prone to introducing substantial transitioning traffic (defined as the amount of data transferred over a network in the redundancy transitioning), as it comprises two I/O-intensive operations: (1) data relocation, which decomposes some stripes and moves the data chunks to the associated nodes to complement the stretched stripes, and (2) parity update, which transmits the newly updated data chunks of each stretched stripe across nodes to update the corresponding parity chunks, so as to maintain the fault tolerance for the new erasure code.
Redundancy transitioning becomes more complicated in data centers, in which nodes are first organized into a cluster and multiple clusters are further interconnected via the network core. Such a hierarchical network structure results in the bandwidth diversity phenomenon [31], where the cross-cluster network bandwidth is usually oversubscribed and appears more stringent than the inner-cluster bandwidth (see Section 2.1). Hence, when examining the performance of existing redundancy transitioning approaches in data centers, we uncover that they still suffer from three key limitations: (L1) they have tight restrictions in practical deployment, (L2) they are prone to introduce substantial cross-cluster transitioning traffic (defined as the amount of data transmitted across clusters for redundancy transitioning), and (L3) they lead to unbalanced cross-cluster transitioning traffic. How to achieve efficient and tunable erasure coding in data centers is unfortunately largely overlooked by existing studies.
We design ClusterRT, a Cluster-aware Redundancy Transitioning approach that aims to minimize the balance of the cross-cluster transitioning traffic. ClusterRT first carefully establishes a network based on the stripe distribution. It then formulates the data relocation in redundancy transitioning as the maximum flow problem, which guides the data relocation to minimize the cross-cluster data transfers. ClusterRT also proves that the least cross-cluster update traffic of a single stripe can be reached via gathering all the associated parity chunks within the same cluster. It then designs a parity-coordinated update approach, which leverages the encoding dependency to further reduce the cross-cluster update traffic for multiple stripes. ClusterRT finally proposes to rotate the identity of the parity cluster (i.e., the cluster that solely stores parity chunks of a transitioning group) to balance the cross-cluster download traffic. Our major contributions can be summarized as follows:
We conduct experiments in real-world data centers and uncover that existing redundancy transitioning schemes are prone to introduce substantial and unbalanced cross-cluster transitioning traffic (Section 2.3).
We present ClusterRT, a cluster-aware redundancy transitioning approach for data centers. ClusterRT formulates the data relocation as a maximum flow problem and designs a parity-coordinated update algorithm, so as to reduce the cross-cluster transitioning traffic. ClusterRT further rotates the identity of the parity cluster to balance the cross-cluster transitioning traffic (Section 3).
We implement a ClusterRT prototype in C++ with around 2,000 lines of code (Section 4).
We evaluate ClusterRT via large-scale simulation and Alibaba Cloud Elastic Compute Service (ECS) [4] experiments. We show that ClusterRT reduces 94.0% to 96.2% of transitioning traffic and shortens 70.4% to 88.4% of transitioning time (Section 5).
The source code of the ClusterRT prototype is available for download at https://github.com/zhangfeng406/ClusterRT.

2 Background and Motivation

2.1 Data Centers and Erasure Coding

Data centers:. We consider the data center with a two-layer hierarchical architecture: (1) in the first layer, nodes are first grouped into clusters, which are interconnected via a common switch; (2) in the second layer, the clusters are further connected via a network core. The hierarchical architecture results in the bandwidth diversity phenomenon, where the cross-cluster network bandwidth is often oversubscribed (e.g., caused by replication writes [8] and shuffles in MapReduce jobs [5]) and is shown to be much scarcer than the intra-cluster network bandwidth. It is reported that the oversubscription ratio, calculated as the ratio of the intra-cluster network bandwidth and cross-cluster network bandwidth, usually ranges from 5 to 20 [12], and can even reach 240 in some extreme cases [12]. Figure 1 shows an example of a data center with four clusters, where each cluster comprises four nodes.
Fig. 1.
Fig. 1. Example of a data center with a \((6,3)\) code. Each cluster can store at most three chunks for cluster-level fault tolerance.
Erasure coding:. Erasure coding is usually defined via two positive integers, namely \(k\) and \(m\), to configure the storage overhead and fault tolerance degree. A \((k,m)\) erasure code encodes \(k\) equal-sized data chunks \(\lbrace D_1, D_2, \ldots , D_{k}\rbrace\) at a time to generate additional \(m\) parity chunks \(\lbrace P_1, P_2, \ldots , P_{m}\rbrace\), where \(P_i=\sum _{j=1}^{k} \alpha _{i,j}\cdot D_j\) and \(\alpha _{i,j}\) is the coefficient assigned for \(D_j\) in the calculation of \(P_i\) (\(1\le i\le m\)); that said, erasure coding creates encoding dependency among the \(k\) data chunks and \(m\) parity chunks. Such \(k+m\) dependent data and parity chunks form a stripe, promising that any \(k\) out of the \(k+m\) chunks within the same stripe always suffice to decode the original \(k\) data chunks (a.k.a. the maximum distance separable (MDS) property [21]). In other words, a \((k,m)\) erasure code can resist any \(m\) chunk failures at the storage overhead of \(\frac{k+m}{k}\) (i.e., \(\frac{k+m}{k}\) times the original uncoded data). Thus, by distributing the \(k+m\) chunks of each stripe across \(k+m\) nodes, the \((k,m)\) code can ensure data reliability against any \(m\) node failures; furthermore, if we place at most \(m\) chunks of every stripe in a cluster, we can tolerate any single cluster failure, as we can always find \(k\) surviving chunks from other available clusters for repair (aside from the failed one). This distribution can promise single-level fault tolerance with optimized repair and update operations, and has been extensively used in prior studies [11, 13, 14, 29, 30, 31, 38, 40]. Some studies also prove that this distribution can achieve a higher data reliability than the flat placement (i.e., placing \(k+m\) chunks of a stripe in \(k+m\) different clusters) [14] and the random placement [29, 30]. Figure 1 shows the placement of a stripe (encoded by the \((6,3)\) code) in the data center, which can tolerate any single cluster failure, as it distributes at most three (i.e., \(m\)) chunks within a cluster.

2.2 Redundancy Transitioning

Redundancy transitioning can balance the fault tolerance and the storage overhead by adjusting the values of \(k\) and \(m\) in the presence of reliability variance and workload changes [39]. Like many existing transitioning studies [22, 23, 32, 40, 43], we mainly consider the redundancy transitioning, which stretches the stripes via increasing the value of \(k\) to reduce the storage overhead (i.e., increasing \(k\) to \(k^{\prime }\), where \(k^{\prime }\gt k\)), without affecting the number of tolerable failures (i.e., keeping \(m\) unchanged). We pose the shortening of the stripes (i.e., removing data chunks from stripes to make them shorter) as our future work.
Definition of redundancy transitioning:. We first formulate the \((k,m,k^{\prime })\)-transitioning problem, where \(k^{\prime }\gt k\). The \((k,m,k^{\prime })\)-transitioning transforms the \((k,m)\) code into the \((k^{\prime },m)\) code by adding another \(k^{\prime }-k\) new chunks to each stretched stripe; besides, it then distributes the at most \(m\) chunks of each stretched stripe within a cluster, thereby maintaining the single-cluster fault tolerance and reducing the storage overhead from \(\frac{k+m}{k}\) (before transitioning) to \(\frac{k^{\prime }+m}{k^{\prime }}\) (after transitioning).
To stretch the stripes, the \((k,m,k^{\prime })\)-transitioning needs to perform two I/O-intensive operations, namely data relocation and parity update. In the data relocation, it decomposes some stripes encoded by the \((k,m)\) code (called decomposed stripes) and relocates the decoupled data chunks to stretch the other stripes (called stretched stripes), making them comprise \(k^{\prime }\) data chunks after stretching. Notice that the data relocation has to ensure that the \(k^{\prime }+m\) chunks of a stretched stripe must still have at most \(m\) chunks stored in a cluster, so as to tolerate the single cluster failure after transitioning.
Figure 2 depicts the \((2,2,3)\)-transitioning with three stripes encoded by the \((2,2)\) code, where each cluster can store at most two chunks of a stripe. We decompose the stripe \(\overline{S}_1\) to enlarge \(S_1\) and \(S_2\). We first add \(D_6\) (stored in the cluster \(L_3\)) to complement \(S_1\) without any cross-cluster data transmission. For the data chunk \(D_5\), as \(S_2\) already has two chunks stored in the cluster \(L_1\), we choose to migrate \(D_5\) to \(L_2\), hence ensuring cluster-level fault tolerance after transitioning. Hence, the data relocation transmits one chunk across clusters.
Fig. 2.
Fig. 2. Example of the (2,2,3)-transitioning. This transitioning transmits three chunks in total across clusters.
After data relocation, we must in a timely manner calculate the new parity chunks of each stretched stripe to establish the encoding dependency between the parity chunks and the \(k^{\prime }\) data chunks, which can be given by
\begin{equation} P_i^{\prime } = \sum _{j=1}^{k^{\prime }} \alpha _{i,j}\cdot D_j = P_i + \sum _{j=k+1}^{k^{\prime }} \alpha _{i,j}\cdot D_j = P_i + \Delta P_i. \end{equation}
(1)
Equation (1) implies that we can calculate a new parity chunk (i.e., \(P_i^{\prime }\), where \(1\le i\le m\)) based on the old one (i.e., \(P_i\)) and the corresponding parity delta chunk (i.e., \(\Delta P_i=\sum _{j=k+1}^{k^{\prime }} \alpha _{i,j}D_j\)), rather than retrieving all the \(k^{\prime }\) data chunks from remote nodes. Figure 2(b) shows the parity update in \((2,2,3)\)-transitioning: (1) for the stretched stripe \(S_1\), we update the associated parity chunks \(P_1\) and \(P_2\) using two parity delta chunks generated from \(D_6\), and (2) for \(S_2\), we update \(P_3\) and \(P_4\) using the parity delta chunks calculated from \(D_5\). The parity update finally transmits two chunks across clusters.

2.3 Limitations of Existing Efforts

To study the performance of state-of-the-art redundancy transitioning approaches in data centers, we conduct experiments on Alibaba Cloud [4]. We choose the \((6,3)\) code [3] as the coding scheme before transitioning and gradually increase the value of \(k\) from 6 to 26. We set the chunk size to 64 MB [3]. More configurations are shown in Section 5.2. We uncover that existing redundancy transitioning approaches still have three key limitations, in terms of the restricted configurations (L1), substantial cross-cluster transitioning traffic (L2), and unbalanced cross-cluster transitioning traffic (L3).
Limitation 1 (L1): Having restrictions in practical deployments. Most existing redundancy transitioning approaches have strict requirements on the selection of the transitioning parameters \(k\) and \(k^{\prime }\) (in terms of the values and the configuration time) and hence cannot support dynamic redundancy transitioning in practice. For example, stripe merging [43] combines \(x\) (where \(x\) is an integer and \(x\gt 1)\) shorter stripes encoded by the \((k,m)\) code into a larger one encoded by the \((xk,m)\) code; that said, it requires \(k^{\prime }\) to be a multiple of \(k\). In addition, SRS [32] and ERS [39] require establishing the values of \(k\) and \(k^{\prime }\) ahead of data storage, which may not be the best option after deployment for different scenarios.
Limitation 2 (L2): Substantial cross-cluster transitioning traffic. Redundancy transitioning needs to relocate data chunks and update parity chunks (Section 2.2). While some redundancy transitioning approaches (e.g., SRS [32] and ERS [39]) eliminate the data relocation traffic, they still incur substantial cross-cluster traffic in multiple redundancy transitioning operations for two fundamental reasons. First, both SRS and ERS propose dedicated placement for given transitioning parameters (i.e., \(k\), \(m\), and \(k^{\prime }\)). When the parameter \(k^{\prime }\) continues to increase, the placement should be accordingly rearranged, thereby incurring additional placement adjustment traffic. Second, most existing redundancy transitioning approaches [32, 39, 43] overlook the bandwidth diversity property in data centers; directly deploying them in data centers easily introduces massive cross-cluster transitioning traffic.
Figure 3 shows the average cross-cluster traffic caused by SRS and ERS when stretching a stripe, which delivers two pieces of information. First, both SRS and ERS experience a significant increase in cross-cluster transitioning traffic, starting from the second redundancy transitioning operation. Second, the cross-cluster transitioning traffic of SRS is mainly attributed to the parity updates, while for ERS, it is primarily dominated by the data relocation.
Fig. 3.
Fig. 3. L2 (substantial cross-cluster transitioning traffic).
In the (12,3,15)-transitioning, we observe ERS and SRS introduce the redundancy transitioning traffic that approaches the optimum. This is primarily due to the assumption of single-cluster fault tolerance, where a cluster is allowed to store at most \(m\) chunks per stripe. Here, when both the values of \(k\) and \(k^{\prime }\) are multiples of \(m\), the contiguous data layout of ERS and SRS further results in the substantial reduction on the layout adjustment traffic.
Limitation 3 (L3): Unbalanced cross-cluster traffic. In practice, redundancy transitioning has to manipulate multiple stripes distributed across different clusters. Existing redundancy transitioning approaches [32, 39, 43] strive to reduce the transitioning traffic, yet they may introduce unbalanced transitioning traffic across clusters. We study the load balancing ratio (the ratio of the most cross-cluster traffic on the heaviest cluster and the average across clusters) of SRS and ERS in data centers. Figure 4 shows that the average load balancing ratio of SRS and ERS ranges from 1.50 to 2.41, which far exceeds the optimum (i.e., one).
Fig. 4.
Fig. 4. L3 (unbalanced cross-cluster traffic).

3 ClusterRT Design

We present ClusterRT, a cluster-aware redundancy transitioning approach to achieve dynamic and tunable erasure coding in data centers. ClusterRT supports general configurations after the deployment of erasure coding (i.e., L1 addressed). It reduces cross-cluster relocation and update traffic, hence accelerating the transitioning process (i.e., L2 addressed). It finally rotates the distributions of parity chunks to balance the cross-cluster transitioning traffic (i.e., L3 addressed).

3.1 Preliminary

To facilitate the understanding of the ClusterRT designs, we first elaborate the preliminaries of ClusterRT. Table 1 provides the major symbols and their descriptions used throughout the paper.
Table 1.
NotationDescription
Defined in Section 2
\(k\)number of data chunks in a stripe
\(m\)number of parity chunks in a stripe
\(k^{\prime }\)number of data chunks in a stripe after transitioning
\(S_i\)the \(i\)th stretched stripe
\(\overline{S}_i\)the \(i\)th decomposed stripe
\(D_i\)the \(i\)th data chunk
\(P_i\)the \(i\)th parity chunk
\(L_{i}\)the \(i\)th storage cluster
Defined in Section 3
\(t\)number of stretched stripes
\(e\)number of decomposed stripes
\(r\)number of clusters
\(l\)number of local parity chunks in a stripe
\(g\)number of global parity chunks in a stripe
\(L^*\)a cluster containing parity chunks
\(O_i\)the \(i\)th local parity chunk
\(G_i\)the \(i\)th global parity chunk
\(\mathcal {G}\)the transitioning group
\(\mathcal {T}\)a set of \(t\) stripes to be stretched
\(\mathcal {E}\)a set of \(e\) stripes to be decomposed
Table 1. Major Notations Used in This Article
Assumptions:. ClusterRT builds on the following assumptions. First, we assume that a cluster can store either the data chunks or parity chunks of a stripe, rather than a combination of them. Second, the placement of each stripe must promise the cluster-level fault tolerance [14, 31]. Third, we mainly focus on Reed–Solomon codes [28] and locally repairable codes [15, 26], which are the most popular erasure codes in production systems.
Transitioning group:. We first define the transitioning group (Tgroup), which is a basic unit to perform redundancy transitioning. A Tgroup \(\mathcal {G}\) consists of \(t\) stretched stripes (denoted by \(\lbrace S_1, S_2,\ldots , S_t\rbrace\)) and another \(e\) decomposed stripes (denoted by \(\lbrace \overline{S}_1, \overline{S}_2, \ldots , \overline{S}_e\rbrace\)), satisfying the following equation:
\begin{equation} \frac{t}{e}=\frac{k}{k^{\prime }-k}. \end{equation}
(2)
Equation (2) ensures that the \(e\) decomposed stripes in a Tgroup can provide \(e\cdot k\) data chunks to opportunely complement the \(t\) stretched stripes, each of which exactly receives \(k^{\prime }-k\) data chunks, hence generating another \(t\) new stripes \(\lbrace S_1^{\prime }, S_2^{\prime }, \ldots , S_t^{\prime }\rbrace\) encoded by the \((k^{\prime },m)\) code.

3.2 Relocation Scheduling

In view of the bandwidth diversity in data centers, we can formulate the relocation problem as a maximum flow problem [33] and explore the minimization of the cross-cluster data transmission in relocation. Specifically, given a Tgroup and a data center with \(r\) clusters (\(r\gt 1\)), ClusterRT can establish a network over \(2+t+r+e\) vertices, with a source, a sink, \(t\) stripe vertices \(\lbrace S_{1}, S_{2},\ldots , S_{t}\rbrace\) representing the stretched stripes, \(r\) cluster vertices \(\lbrace L_{1}, L_{2},\ldots , L_{r}\rbrace\), and another \(e\) stripe vertices \(\lbrace \overline{S}_{1}, \overline{S}_{2},\ldots , \overline{S}_{e}\rbrace\) representing the decomposed stripes. ClusterRT then adds the edges and the associated capacities over the network based on the following rules:
It first connects each decomposed stripe \(\overline{S}_{i}\) (where \(1\le i\le e\)) to the source with the capacity of \(k\), indicating that each decomposed stripe will supply at most \(k\) data chunks in the transitioning. Hence, we get \(e\) edges in this phase.
ClusterRT then adds an edge between a decomposed stripe \(\overline{S}_{i}\) (where \(1\le i\le e\)) and a cluster \(L_{j}\) (where \(1\le j\le r\)) with the capacity of \(c_{i,j}\) (where \(1\le c_{i,j}\le m\)), where \(c_{i,j}\) denotes the number of data chunks that the decomposed stripe \(\overline{S}_{i}\) stores in the cluster \(L_j\). For example, in Figure 5, since the decomposed stripe \(\overline{S}_1\) stores one data chunk (i.e., \(D_{10}\)) in \(L_2\) and two data chunks (i.e., \(D_{11}\) and \(D_{12}\)) in \(L_4\), we connect \(\overline{S}_1\) with the clusters \(L_2\) and \(L_4\) with the capacities of \(c_{1,2}=1\) and \(c_{1,4}=2\), respectively. Hence, we generate \(e\cdot r\) edges in this phase.
Fig. 5.
Fig. 5. Example of the stripe organization before transitioning in (3,2,5)-transitioning (i.e., \(k=3\), \(m=2\), and \(k^{\prime }=5\)).
ClusterRT further needs to identify the data chunks (from the decomposed stripes) that a cluster can supply in the transitioning without introducing any cross-cluster data transfer. It connects the cluster \(L_{j}\) (where \(1\le j\le r\)) with the stretched stripe \(S_{h}\) (where \(1\le h\le t\)) with the capacity of \(c_{j,h}\) once \(L_{j}\) can potentially supply \(c_{j,h}\) data chunks to complement \(S_{h}\) without any cross-cluster data transfer (where \(c_{j,h}\gt 0\)). For example, in Figure 5, we connect \(L_{2}\) with \(S_{1}\) and \(S_{2}\), since the data chunk that \(L_{2}\) can supply (i.e., \(D_{10}\)) can be directly added to either \(S_{1}\) or \(S_{2}\) without performing cross-cluster transmission. Consequently, we have \(t\cdot r\) edges in this phase.
ClusterRT finally connects all the stretched stripes to the sink with the capacity of \(k^{\prime }-k\), indicating that each stretched stripe needs an additional \(k^{\prime }-k\) data chunks to accomplish the redundancy transitioning (see Figure 6). Therefore, we get \(t\) edges in this phase.
Fig. 6.
Fig. 6. Example of establishing a maximum flow in (3,2,5)-transitioning (i.e., \(k=3\), \(m=2\), and \(k^{\prime }=5\)).
Given the network, we can establish a maximum flow based on the edges and corresponding capacities, which indicates the assignment of the data chunks from the decomposed stripes to the stretched stripes, so as to minimize the cross-cluster relocation traffic. The graph comprises \(2+t+e+r\) vertices and \((r+1)\cdot (e+t)\) edges in total. Hence, the computational complexity is \(O((t+e+r)^2\cdot r\cdot (e+t))\) if using Dinic’s algorithm [17] to establish the maximum flow.
Example:. Figure 6 shows the resulting maximum flow established based on the stripe placement in Figure 5. Figure 7 further depicts the data relocation guided under the maximum flow of Figure 6 and the resulting placement of the stretched stripes (i.e., \(\lbrace S_{1}, S_{2}, S_{3}\rbrace\)). For example, based on the assignment (marked in the thick arrows) in Figure 6, the cluster \(L_{3}\) will supply two chunks (i.e., \(D_{13}\) and \(D_{14}\)) from the decomposed stripe \(\overline{S}_{2}\) to enlarge the stretched stripes \(S_{3}\) and \(S_{2}\), respectively, without any cross-cluster transmission. We can find that ClusterRT does not induce any cross-cluster relocation traffic in this transitioning. We also perform large-scale simulations and demonstrate that ClusterRT can eliminate the cross-cluster relocation traffic in most redundancy transitioning operations (see Experiment A.2 in Section 5.1).
Fig. 7.
Fig. 7. Example of the stripe organization after transitioning in (3,2,5)-transitioning (i.e., \(k=3\), \(m=2\), and \(k^{\prime }=5\)).

3.3 Parity-coordinated Updates

After establishing the data relocation strategy, one must in a timely manner update the parity chunks of the stretched stripes, with the objective of promising the reliability of the newly added data chunks. We first deduce the minimum parity update traffic in a single stretched stripe and then explore the optimization under multiple stripes.
Parity update in a single stripe:. We first investigate the impact of the parity placement on the cross-cluster update traffic. We have Lemma 1, which proves that by clustering all the parity chunks of a stretched stripe within the same cluster, we can touch the minimum cross-cluster update traffic.
Lemma 1.
Without loss of generality, suppose that a stretched stripe \(S^{\prime }\) comprises \(m\) parity chunks that are distributed across \(y\) clusters with the parity distribution of \(\lbrace p_1, p_2, \ldots , p_y\rbrace\), satisfying that \(0\lt p_1\le p_2\cdots \le p_y\le m\) (for cluster-level fault tolerance) and \(\sum _{i=1}^{y}p_i=m\). The least cross-cluster update traffic can be touched when \(y=1\) and \(p_1=m\) (i.e., the \(m\) parity chunks are gathered within a single cluster).
Proof.
Without loss of generality, suppose that the \(k^{\prime }-k\) newly added data chunks used to enlarge the stretched stripe \(S^{\prime }\) are also stored across \(x\) clusters (\(x\ge 1\)) with the data distribution of \(\lbrace d_1, d_2, \ldots , d_x\rbrace\), satisfying that \(0\lt d_1 \le d_2 \le \cdots \le d_x \le m\) (for cluster-level fault tolerance) and \(\sum _{i=1}^{x} d_i=k^{\prime }-k\). Hence, by using the selective parity update [11, 29], we can calculate the number of chunks transmitted across clusters for parity update in a single stripe. Specifically, for the cluster \(L_i\) that stores \(d_i\) data chunks and the cluster \(L_j\) that stores \(p_j\) parity chunks (where \(i\ne j\)), there are two options to update the \(p_j\) parity chunks based on the \(d_i\) data chunks: (1) if \(d_i\le p_j\), then the selective parity update will transmit the \(d_i\) data chunks from \(L_i\) to \(L_j\) for parity update, and (2) if \(d_i\gt p_j\), then the selective parity update will calculate \(p_j\) parity delta chunks (see Equation (1)) in the cluster \(L_i\) and transmit them across clusters to \(L_j\). Finally, the cross-cluster parity update traffic can be calculated as \(T_{\text{update_{single}}}=\sum _{i=1}^{x}\sum _{j=1}^{y}\min \lbrace d_i,p_j\rbrace\).
We now deduce the lower bound of \(\sum _{j=1}^{y}\min \lbrace d_i,p_j\rbrace\) for a given \(d_i\). Given the number of data chunks \(d_i\) in a cluster (where \(1\le i\le x\)), we can establish an integer \(l\) to differentiate the parity distribution \(\lbrace p_1,p_2,\ldots ,p_y\rbrace\) (where \(0\lt p_1\le p_2\le \cdots \le p_y\le m\)), satisfying that \(p_j\lt d_i\) for \(1\le j\le l\) and \(p_j\ge d_i\) for \(l+1\le j\le y\). Hence, we can deduce that
\begin{equation} \begin{split}\sum _{j=1}^{y}\min \lbrace d_i,p_j\rbrace = & \underbrace{\sum _{j=1}^{l}\min \lbrace d_i,p_j\rbrace }_{\text{where }p_j\lt d_i} + \underbrace{\sum _{j=l+1}^{y}\min \lbrace d_i,p_j\rbrace }_{\text{where }p_j\ge d_i} \\ = & \sum _{j=1}^{l}p_j + \sum _{j=l+1}^{y} d_i. \end{split} \end{equation}
(3)
There are two possibilities. If \(l=y\), then we have \(\sum _{j=1}^{y}\min \lbrace d_i,p_j\rbrace = \sum _{j=1}^{l}p_j + \sum _{j=l+1}^{y} d_i\ge \sum _{j=1}^{y}p_j\). On the other hand, if \(0\le l\lt y\), then we can deduce that \(\sum _{j=1}^{y}\min \lbrace d_i,p_j\rbrace = \sum _{j=1}^{l}p_j + \sum _{j=l+1}^{y} d_i\ge d_i\). Hence, we can have
\begin{equation} \sum _{j=1}^{y}\min \lbrace d_i,p_j\rbrace = \sum _{j=1}^{l}p_j + \sum _{j=l+1}^{y} d_i \ge \min \lbrace d_i,\sum _{j=1}^{y}p_j\rbrace . \end{equation}
(4)
Since \(\sum _{j=1}^y p_j=m\) and \(d_i\le m\) for \(1\le i\le x\), we have \(T_{\text{update_single}}=\sum _{i=1}^{x}\sum _{j=1}^{y}\min \lbrace d_i,p_j\rbrace \ge \sum _{i=1}^{x} d_i=k^{\prime }-k\). The equation establishes when we simply place all the \(m\) parity chunks of the stretched stripe within the same cluster (i.e., \(y=1\)). □
Example:. Figure 8 shows an example of the parity update in a single stripe during the (3,2,5)-transitioning. In this example, two data chunks (i.e., \(\lbrace D_4, D_5\rbrace\)) are used to stretch the stripe. By dispersing parity chunks (Figure 8(a)) of the stripe into two clusters (i.e., \(\lbrace L_1, L_5\rbrace\)), we need to transmit two data chunks (i.e., \(\lbrace D_4, D_5\rbrace\)) to \(L_1\) and \(L_5\) for parity updates, respectively. This approach transmits four chunks across clusters in total. As a comparison, by gathering parity chunks of the stripe within one cluster (i.e., \(\lbrace L_1\rbrace\), see Figure 8(b)), we only need to transmit two data chunks (i.e., \(\lbrace D_4, D_5\rbrace\)) to \(L_1\) for parity updates. It implies that gathering \(m\) parity chunks in a single cluster can reduce the number of cross-cluster transferred chunks (from four to two) during parity updates, since we only need to transmit each newly added chunk to the parity cluster (i.e., \(L_1\) in Figure 8(b)) to update all the stored parity chunks.
Fig. 8.
Fig. 8. Example of the single stripe parity update in (3,2,5)-transitioning.
In addition, the parity clustering also favors the update efficiency. The reason is that by placing all parity chunks of a stripe in a single cluster, updating a data chunk in a stripe only needs to transmit a single data delta chunk to the parity cluster, which significantly reduces the cross-cluster parity update traffic compared to the approach that scatters the parity chunks across multiple clusters (see Experiment A.5 in Section 5.1).
Parity-independent update:. Based on Lemma 1, we can derive a parity-independent update approach, which simply updates the parity chunks of each stretched stripe with the associated \(k^{\prime }-k\) newly added data chunks independently. Hence, the number of data chunks transmitted across clusters needed by the parity-independent updates for a Tgroup can be calculated as
\begin{equation} T_{\text{update_idpdnt}} = t\cdot T_{\text{update_single}}=t\cdot (k^{\prime }-k). \end{equation}
(5)
Parity-coordinated update:. To further reduce the cross-cluster parity update traffic in the multi-stripe setting, ClusterRT designs a parity-coordinated update approach. The main idea is to (1) gather all the parity chunks of the \(t+e\) stripes of the same Tgroup in a cluster (called parity cluster in this Tgroup) and (2) exploit the encoding dependency between the data and parity chunks of the same stripe. By doing so, we can ensure that we only need to transmit \(k-m\) data chunks for each decomposed stripe, hence reducing the number of data chunks transmitted across clusters for parity update. Algorithm describes the procedure of the parity-coordinated updates.
Given a Tgroup, we first identify the parity cluster \(L^*\) that stores all the parity chunks of this Tgroup (Line 2). For each decomposed stripe, we transmit the first \(k-m\) data chunks to \(L^*\) without loss of generality (Line 5). The \(k-m\) data chunks of the stripe \(S_i\), together with the associated \(m\) parity chunks in \(L^*\), can be used to decode the original \(k\) data chunks (Line 6) and added to the set of data chunks (denoted by \(\mathbb {D}\)) decoupled from the decomposed stripes (Line 7). For each stretched stripe, ClusterRT fetches the corresponding \(k^{\prime }-k\) newly added data chunks guided under the relocation scheduling (Section 3.2) and then updates the associated \(m\) parity chunks (Lines 10 to 15).
Example:. Figure 9 shows an example of the parity-coordinated update approach in the \((3,2,5)\)-transitioning. We use the decomposed stripe \(\overline{S}_1\) as an instance. We transmit a chunk (i.e., \(D_{10}\)) from the cluster \(L_2\) to \(L_1\) (i.e., \(L^*\) here, Step ❶). At this time, \(L_1\) comprises three (i.e., \(k\)) chunks of \(\overline{S}_1\), and we can perform data decoding (Step ❷) to generate the three original data chunks (i.e., \(D_{10}\), \(D_{11}\), and \(D_{12}\)). As these data chunks are added to three stretched stripes (i.e., \(S_1\), \(S_2\), and \(S_3\)), they will then be used to update the associated parity chunks (Step ❸). Consequently, we only need to transmit two data chunks to accomplish the parity update.
Fig. 9.
Fig. 9. Example of the parity-coordinated update in (3,2,5)-transitioning (i.e., \(k=3\), \(m=2\), and \(k^{\prime }=5\)).
We can also have the following corollary, which provides a quantitative comparison between the parity-independent update and the parity-coordinated update approaches.
Corollary 1.
The parity-coordinated update needs less cross-cluster update traffic than the parity-independent update.
Proof.
Since each decomposed stripe sends \(k-m\) data chunks to the cluster \(L^*\) for data decoding, the parity-coordinated updates should transmit \(T_{\text{update_cordnt}}=e\cdot (k-m)\) data chunks across clusters for a Tgroup in total. Hence, based on Equation (2) and Equation (5), we have
\begin{equation} \frac{T_{\text{update_cordnt}}}{T_{\text{update_idpdnt}}} = \frac{e\cdot (k-m)}{t\cdot (k^{\prime }-k)}=\frac{k-m}{k} \lt 1. \end{equation}
(6)
The corollary holds. □
Parity rotation for load balancing:. Since the parity cluster receives (downloads) all the transmitted data chunks for the parity update, it will become the heaviest one that affords the most cross-cluster transitioning traffic. Our breakdown analysis (Experiment A.2) also shows that ClusterRT does not introduce any cross-cluster data relocation traffic in most transitioning operations (see Figure 13) and the cross-cluster parity update traffic is the predominant traffic in redundancy transitioning. Hence, we propose to balance the cross-cluster parity update traffic in this article. To balance the cross-cluster update traffic, ClusterRT further gradually rotates the identity of the parity cluster of each Tgroup, so as to evenly disperse the cross-cluster download traffic across the whole data center. As the parity chunks will not be relocated in the transitioning, it can effectively balance the cross-cluster transitioning traffic under multiple transitioning operations. Figure 10 shows an example with three Tgroups. We rotate the parity clusters, ensuring that each cluster stores the same number of parity chunks.
Fig. 10.
Fig. 10. Example of the parity rotation for load balancing with a \((3,2)\) code.

3.4 Discussions

Extension for LRC codes:. We define the Locally Repairable Codes (LRCs) using the notation LRC\((k,l,g)\), which specifies a coding scheme that encodes \(k\) data chunks, denoted as \({D_1, D_2, \ldots , D_k}\), into \(l\) local parity chunks \({O_1, O_2, \ldots , O_l}\) and \(g\) global parity chunks \({G_1, G_2, \ldots , G_g}\). This process generates a stripe comprising \(k + l + g\) data and parity chunks. These chunks are then distributed across \(k+l+g\) nodes. We mainly consider the redundancy transitioning for LRCs, which stretches the stripe via increasing the value of \(k\) to reduce the storage overhead (i.e., increasing \(k\) to \(k^{\prime }\), where \(k^{\prime }\gt k\)), without affecting the number of tolerable failures (i.e., keeping \(l\) and \(g\) unchanged). Similarly, we also consider the single-cluster fault tolerance, which indicates that we can place no more than \(g+i\) chunks that span \(i\) local groups into a single cluster [40], where \(1\le i \le l\). Therefore, we can place all parity chunks (i.e., \(l\) local parity chunks and \(g\) global chunks) in the same cluster to reduce the cross-cluster parity update traffic during redundancy transitioning (see Section 3.3). In addition, we can also formulate the relocation problem as a maximum flow problem. With Reed–Solomon codes, ClusterRT establishes a maximum flow algorithm based on the number of chunks in each cluster for the stripe (see Section 3.2). Similarly, under the encoding of LRCs, ClusterRT also formulates a maximum flow algorithm by considering the number of data chunks in each cluster as well as the number of local groups within each cluster. This methodology ensures that the stretched stripe aligns with the constraint that the number of data chunks in each cluster does not exceed \(g+i\).
Integration with Hadoop HDFS: . When integrating ClusterRT into Hadoop HDFS, we can deploy the ClusterRT coordinator in the NameNode and run the ClusterRT agents in the DataNodes to seamlessly integrate ClusterRT into HDFS. Upon receiving a transitioning request from the client, the ClusterRT coordinator within the NameNode generates a redundancy transitioning scheme based on the metadata information provided by the NameNode (e.g., the stripe placement) and dispatches the scheme to each DataNode. Subsequently, upon receiving the transition scheme, the ClusterRT agents within the DataNodes proceed to update the parity chunks of stretched stripes while removing the parity chunks of decomposed stripes. After the transitioning completes, the ClusterRT coordinator will update the metadata of stripes.

4 Implementation

We implement a ClusterRT prototype in C++ with around 2,000 lines of code (LoC). We use Jerasure [27] to realize the encoding and decoding functionalities over the Galois Field and employ Linux raw socket interfaces to realize data transmissions across nodes.
System architecture:. Figure 11 depicts the system architecture of the ClusterRT prototype, which comprises a centralized coordinator and multiple agents. The coordinator sits in the metadata server, which is in charge of generating transitioning decisions based on the data distributions and transitioning parameters (i.e., \(k\), \(m\), and \(k^{\prime }\)) and instructing the redundancy transitioning with the metadata information (e.g., the locations of the data and parity chunks of each stripe). The transitioning decision can be represented by a customized data structure, which specifies the chunks to be transmitted as well as their destination nodes. The agent consists of the following functionalities: (1) it listens to the transitioning decisions from the coordinator; once receiving a transitioning decision, it reads the requested chunks, relocates them to the specified destination nodes, and encodes the newly added data chunks to update the associated parity chunks.
Fig. 11.
Fig. 11. System architecture of ClusterRT.
Operation flow:. We elaborate on the transitioning procedure in ClusterRT. When a transitioning request is reported to the metadata server, the coordinator first generates the transitioning decisions, which will then be dispatched to the corresponding agents of the nodes participating in the transitioning (Step ❶). After receiving the transitioning decision, each agent can parse the decision to understand its role in the transitioning, including which chunks should be migrated in data relocation and be sent for parity chunk updates (Step ❷). An agent will notify the coordinator of the completeness of its mission by returning an ACK response to the coordinator (Step ❹). The coordinator can be aware of the completeness of the transitioning operation once all the ACKs from the participating nodes are collected.

5 Evaluation

We conduct both numerical simulation and Alibaba Cloud ECS experiments to measure the transitioning traffic and the transitioning time of ClusterRT. We summarize our major findings as below: compared to the state-of-the-art transitioning approaches [32, 39], (1) ClusterRT can reduce 94.0% to 96.2% of the transitioning traffic and effectively balance the transitioning traffic (Section 5.1); (2) ClusterRT can shorten 70.4% to 88.4% of the transitioning time in the successive redundancy transitioning (Section 5.2).

5.1 Numerical Simulation

Comparison approaches:. We compare ClusterRT to two state-of-the-art approaches: SRS [32] and ERS [39], both of which aim to accelerate the redundancy transitioning for RS codes. We elaborate on the main ideas of SRS and ERS as below:
SRS [32]: It establishes the distributions of stripes based on the pre-fixed coding parameters \((k,m, k^{\prime })\) and eliminates the data relocation for the first transitioning operation.
ERS [39]: It also eliminates the data relocation for the first transitioning operation and reduces parity update traffic via enlarging the encoding matrix in advance.
Experimental setup:. We first carry out numerical simulation by disabling the functionalities of network transmissions and storage operations. We mainly consider the general erasure coding configurations. We deploy a \((k,m)\) code onto a storage system and increase the value of \(k\), while keeping the value of \(m\) constant. We also adopt the following default configurations unless otherwise specified. We set the number of stripes to 100,000 and distribute them across 40 clusters. We set the chunk size to 64 MB. We then measure the average transitioning traffic induced to stretch a stripe in a transitioning operation.
Experiment A.1 (reduction on the cross-cluster transitioning traffic):. We first evaluate the cross-cluster transitioning traffic generated by each approach to achieve successive transitioning. We performed 15 transitioning operations by increasing the value of \(k\) from 6 to 96 and tested the cross-cluster transitioning traffic under three approaches. It can be seen from Figure 12 that ClusterRT greatly reduces the cross-cluster transitioning traffic compared with SRS and ERS.
Fig. 12.
Fig. 12. Experiment A.1 (reduction on the cross-cluster transitioning traffic).
As shown in Figure 12, ClusterRT maintains a small cross-cluster transitioning traffic during the successive transitioning operations, while SRS and ERS only achieve a small cross-cluster transitioning traffic during the first transitioning operation. This is mainly because both SRS and ERS maintain an enlarged stripe layout before the first transitioning, thus eliminating the data chunk relocation traffic during the first transitioning. However, in the successive transitioning operations, SRS and ERS require chunk relocation to adjust the stripe layout. Therefore, in the successive transitioning operation, SRS and ERS will incur large traffic of cross-cluster transmission when adjusting the layout, thus introducing additional traffic. Furthermore, compared with SRS, ERS reduces the traffic of parity chunk updates through its coding algorithm.
As a comparison, ClusterRT utilizes the maximum flow algorithm to facilitate data chunk relocation and minimize cross-cluster relocation traffic. Additionally, to minimize the cross-cluster update traffic, ClusterRT places the parity chunks of decomposed stripes and stretched stripes within the same cluster in advance, which reduces the number of data chunks to be read. Consequently, compared to ERS and SRS, ClusterRT can reduce 94.0% and 96.2% of the transitioning traffic on average, respectively.
Experiment A.2 (breakdown analysis of cross-cluster transitioning traffic):. We then evaluate the breakdown analysis of the cross-cluster transitioning traffic to learn the impact of different operations performed in redundancy transitioning. According to Figure 13, we make two observations. First, ClusterRT successfully eliminates the cross-cluster relocation traffic across all redundancy transitioning operations by establishing maximum flows to allocate data chunks. Second, compared to ERS and SRS, ClusterRT always consumes the lower-parity update traffic, up to 75.0% and 96.1%, respectively, indicating that the parity-coordinated update approach can effectively achieve relatively low-parity update traffic. In summary, the two techniques in ClusterRT are both effective and complementary without comprising the effectiveness of each other.
Fig. 13.
Fig. 13. Experiment A.2 (breakdown analysis of cross-cluster transitioning traffic).
Experiment A.3 (impact of number of clusters):. We increase the number of clusters from 10 to 50 and measure the cross-cluster transitioning traffic under different erasure coding paradigms.
Figure 14 shows that the cross-cluster transitioning traffic of ClusterRT maintains 0.1 while the number of clusters keeps changing. The reason is that ClusterRT can eliminate cross-cluster data chunk relocation using the maximum flow algorithm in the vast majority of cases, and the parity-coordinated update approach only requires a few data chunks to be transmitted across clusters. Thus, ClusterRT can keep a relatively low cross-cluster transitioning traffic during successive redundancy transitioning. On the contrary, the traffic of ERS and SRS stays high because the cost of layout adjustment increases with the number of clusters. To conclude, ClusterRT achieves at most 97.6% and 97.5% lower cross-cluster traffic compared with ERS and SRS, respectively.
Fig. 14.
Fig. 14. Experiment A.3 (impact of number of clusters).
Experiment A.4 (load balance):. We evaluate the load balancing ratio of the three approaches. Figure 15 shows that ClusterRT effectively balances the transitioning traffic loaded across clusters, since it adopts a download bandwidth balance layout that each cluster downloads the same amount of data chunks. Statistically, the average load balancing ratio for both ERS and SRS is 2.4 across all the successive transitioning operations, while ClusterRT achieves a load balancing ratio of 1, which corresponds to the optimal.
Fig. 15.
Fig. 15. Experiment A.4 (load balancing ratio).
Experiment A.5 (update performance):. We study the load balancing ratio and cross-cluster update traffic for two data placement approaches: ClusterRT’s placement (see Section 3.3) and the random placement. This experiment aims to validate that ClusterRT’s strategy of placing all parity chunks of a stripe within the same cluster does not downgrade performance under normal mode. We use YCSB [9] to generate workloads, which consist of 1 million transactions with a read/update ratio of 50:50. These workloads include both Zipfian (\(\alpha =0.99\)) and Uniform distributions to encompass a diverse range of access patterns.
Figures 16(a) and 16(b) show that both ClusterRT and the random placement exhibit similar load balancing ratios across all erasure coding parameters, regardless of whether the workload follows a Zipfian or Uniform distribution. Consequently, it indicates that ClusterRT’s parity placement does not lead to the concentration of update traffic in any specific cluster. Moreover, by placing all parity chunks of a stripe within a cluster, ClusterRT can effectively reduce cross-cluster update traffic. This is because ClusterRT only needs to transmit the data delta chunks across clusters to the parity cluster, such that the parity chunks can be updated internally. Figures 16(c) and 16(d) show that compared to the random placement, ClusterRT achieves an average reduction of 64.2% in cross-cluster traffic for update operations.
Fig. 16.
Fig. 16. Experiment A.5 (update performance).
Experiment A.6 (performance for LRC codes):. We measure the cross-cluster transitioning traffic generated by ClusterRT and the random placement method for LRCs, where the random placement distributes the chunks of every stripe at random under the requirement of cluster-level fault tolerance. We perform 12 transitioning operations by increasing the value of \(k\) from 4 to 60 while keeping the values of \(l\) and \(g\) both untouched. We measure the resulting cross-cluster transitioning traffic.
Figure 17 shows that ClusterRT greatly reduces the cross-cluster transitioning traffic compared with the random placement. ClusterRT introduces a small amount of cross-cluster transitioning traffic during successive transitioning operations. This is because ClusterRT employs the maximum flow to minimize the cross-cluster relocation traffic. Furthermore, by placing both global and local parity chunks of the same transitioning group within the same cluster, ClusterRT significantly reduces the cross-cluster traffic when updating parity chunks. Overall, compared to the random placement, ClusterRT can reduce 77.2% of cross-cluster traffic on average during transitioning.
Fig. 17.
Fig. 17. Experiment A.6 (reduction on the cross-cluster transitioning traffic with LRC codes).
Experiment A.7 (comparison with convertible codes):. We finally measure the cross-cluster transitioning traffic introduced by ClusterRT and convertible codes [22, 23]. We realize stripe merge to meet the conversion requirements of convertible codes by setting the parameter \(k^{\prime }\) in the \((k, m, k^{\prime })\)-transitioning to be a multiple of \(k\). Figure 18 shows the cross-cluster transitioning traffic of ClusterRT and convertible codes under different \(k\) and \(k^{\prime }\). We can observe that compared to convertible codes, ClusterRT gains lower cross-cluster transitioning traffic, mainly because it uses the maximum flow algorithm to reduce cross-cluster relocation traffic. Besides, ClusterRT pre-locates the parity chunks of decomposed stripes and stretched stripes within the same cluster to reduce cross-cluster update traffic. Moreover, convertible codes only optimize access costs in the parity update, without considering the optimization of cross-cluster network traffic and data relocation. In summary, ClusterRT achieves a reduction in cross-cluster transitioning traffic compared to convertible codes by 32.4% to 66.1%.
Fig. 18.
Fig. 18. Experiment A.7 (compared with convertible codes).

5.2 Testbed Experiments

We conduct further evaluation of ClusterRT on Alibaba Cloud ECS [4] to uncover its performance in a real-world cloud data center. Specifically, we allocate 19 virtual machine instances (with the type of ecs.g7.large), where each instance is equipped with 2vCPU (2.7 GHz 3rd Intel Xeon Scalable Processors) and 8 GB memory. The operating system is Ubuntu 18.04, and we measure via iperf that the network bandwidth between any two instances is around 10 Gb/s.
Experimental setup:. We deploy the ClusterRT coordinator in one instance acting as the metadata server. We choose RS(6,3) as the default erasure coding paradigm and take the other 18 instances as the storage nodes with ClusterRT agents atop. We also adopt the following default configurations unless otherwise specified. The chunk size is set to 64 MB and the value of \(k\) is successively increased from 6 to 15 according to the state-of-the-art erasure coding deployments [13]. Before the experiment, we deployed 300 stripes, and each cluster contains three nodes. We then use Linux tool tc to throttle the cross-cluster network bandwidth between any two clusters from 1 Gb/s to 2 Gb/s (cross-cluster network bandwidth is set to 1 Gb/s by default) and measure the average transitioning time per stripe (defined as the time needed to stretch a stripe on average). We repeat each experiment for five runs and plot the average results, as well as the error bars indicating the maximum and minimum values across the experiments (some may be invisible as they are very small).
Experiment B.1 (transitioning time):. We first measure the transitioning time when the value of \(k\) increases from 6 to 15. We fix the intra-cluster network bandwidth to 10 Gb/s and the cross-cluster network bandwidth to 1 Gb/s (the ratio of the intra-cluster transfer speed to the cross-cluster transfer speed is around 10:1). Figure 19 shows the results.
Fig. 19.
Fig. 19. Experiment B.1 (transitioning time).
Figure 19 implies that ClusterRT has a stable and short transitioning time, while the transitioning time of SRS and ERS shows an upward trend with the increase of \(k\). First of all, SRS always generates the longest transitioning time in successive transitioning operations. This is mainly because SRS consumes more data chunk relocation traffic to adjust the layout and needs to read more data chunks for the parity chunk update. Second, we find that ClusterRT requires a shorter transitioning time than ERS even in the first transitioning operation, and much shorter in the successive transitioning operation. The main reason is that ClusterRT does not need to relocate a large number of data chunks across the cluster to adjust the layout as ERS does. Meanwhile, ClusterRT aggregates parity chunks in the same cluster, thus reducing the cross-cluster reading of data chunks during parity updates. In addition, ClusterRT further reduces the tail latency by balancing the download traffic of each cluster. Compared to ERS and SRS, ClusterRT can reduce 70.4% and 88.4% of the transitioning time on average, respectively.
Experiment B.2 (computation time):. We run the experiment in an instance on Alibaba Cloud ECS and study the computation time (i.e., the total time of generating the transitioning solutions) by increasing the number of stripes. Figure 20 shows the results.
Fig. 20.
Fig. 20. Experiment B.2 (computation time).
We find that the computation time of ClusterRT to generate the transitioning scheme is always marginal, meaning that the computation time is negligible compared with the transmission time of the transitioning operation.
Experiment B.3 (impact of chunk size):. We also study the transitioning time under different chunk sizes, which are varied from 16 MB to 64 MB. Here, we set the cross-cluster network bandwidth to 1 Gb/s. Figure 21 implies that the transitioning time increases with the chunk size. We can observe that with the increase of chunk size, ClusterRT reduces more transitioning time compared to SRS and ERS. This improvement can be attributed to the limited cross-cluster network bandwidth, combined with the extensive cross-cluster chunk transfers introduced in ERS and SRS during redundancy transitioning. Larger chunk sizes exacerbate these limitations, hence prolonging transitioning times. Statistically, ClusterRT reduces the transitioning time by 33.3% to 88.9% and 73.3% to 93.2% compared to ERS and SRS, respectively.
Fig. 21.
Fig. 21. Experiment B.3 (impact of chunk size).
Experiment B.4 (impact of cross-cluster network bandwidth):. We finally investigate the impact of cross-cluster network bandwidth, which is increased from 1 Gb/s to 10 Gb/s by using tc. Figure 22 shows that the transitioning time decreases when the cross-cluster network bandwidth gets higher. Besides, ClusterRT outperforms both ERS and SRS under different cross-cluster network bandwidths. At the same time, when the cross-cluster network bandwidth is more stringent, ClusterRT performs better, as it reduces the cross-cluster transitioning traffic and balances the transitioning traffic. We observe that as the cross-cluster network bandwidth increases, the decrease in redundancy transitioning time gained by ClusterRT becomes less significant compared to ERS and SRS in most cases. However, in some specific cases (such as the (12,3,15)-transitioning with a cross-cluster network bandwidth of 10 Gb/s), ClusterRT exhibits a more significant reduction in redundancy transitioning time compared to ERS and SRS. This is because the performance bottleneck of ERS and SRS shifts to the intra-cluster network bandwidth, compromising the performance improvement. To summarize, ClusterRT reduces the transitioning times of ERS and SRS by 7.6% to 88.9% and 73.3% to 93.8%, respectively.
Fig. 22.
Fig. 22. Experiment B.4 (impact of cross-cluster network bandwidth).

6 Related Work

Storage scaling:. Storage scaling enlarges the stripes (i.e., more chunks are added to a stripe) when the system scales (i.e., new nodes are added to the system). SLAS [44], FastScale [47], ALV [45], SDM [37], and H-Scale [34] reduce the data transfers for RAID scaling. However, they are unworkable for the system that requires tolerating more than double failures. Being designed for RS codes, Scale-RS [16] minimizes data transfer and achieves uniform data distribution after scaling. NCScale [46] can touch the optimal (or near-optimal) scaling traffic based on network coding. ECHash [7] eliminates the parity update via employing data fragmentation and cross-coding, yet it still needs to relocate data chunks onto new nodes. However, they either incur tremendous traffic in successive storage scaling or cannot fully leverage bandwidth resources.
Redundancy transitioning:. Some studies consider the transitioning among erasure codes in distributed storage systems without changing the system scale. Their main objective is to minimize unnecessary data movements during the transitioning. HACFS [41] employs a fast code to optimize data repair and another compact code to reduce storage overhead; it realizes an efficient conversion (transitioning) approach to upcode and downcode data chunks between the fast and compact codes. Wang et al. [35] consider a similar problem but use different erasure codes to serve as the fast and compact codes. These two transitioning approaches only realize the conversion between two specific codes, while ClusterRT supports successive transitioning with different encoding parameters.
SRS [32] proposes to pre-allocate data chunks on the nodes and sustains the data layout after transitioning, so as to eliminate data relocation in transitioning. However, SRS still incurs intensive parity update traffic. In view of this, ERS [40] designs a novel encoding matrix construction and data placement to increase the overlapping data chunks, such that the number of data chunks to be read during the parity update can be greatly reduced. Both StripeMerge [43] and Convertible codes [23] generate a wide stripe (i.e., with a larger \(k\)) by merging two narrow stripes (i.e., with a smaller \(k\)) and reduce the data relocation and parity update traffic during the stripe merging. However, SRS [32], ERS [40], and StripeMerge [43] only reduce the transitioning traffic and do not consider the bandwidth diversity in hierarchical data centers. Unlike previous studies, ClusterRT reduces and balances the cross-cluster transitioning traffic in erasure-coded data centers. Wu et al. [38] merge multiple narrow stripes of LRCs into a wide stripe and propose an optimal data placement to minimize the cross-cluster transitioning traffic for stripe merging. Convertible codes [22, 23] can facilitate efficient code conversions while minimizing resource usage. As a comparison, ClusterRT can support the redundancy transitioning operations with more general configurations, provided that \(k^{\prime }\gt k\). It also minimizes and balances the cross-cluster transitioning traffic for erasure-coded data centers.

7 Conclusions

We present ClusterRT, an approach that achieves tunable erasure coding in data centers. ClusterRT formulates the data relocation as the maximum flow problem and designs the parity placement that is proved to touch the theoretical lower bound of the cross-cluster update traffic in a single stripe. It then leverages the encoding dependency to reduce the cross-cluster update traffic for multiple stripes and explores the load balancing to accelerate the transitioning process. Both large-scale simulation and testbed experiments show the transitioning efficacy of ClusterRT when compared to the state of the arts.

References

[1]
Ceph. 2016. Erasure Coding in Ceph. Retrieved from https://docs.ceph.com/en/latest/rados/operations/erasure-code/
[2]
OpenStack. 2019. Erasure Code Support. Retrieved from https://docs.openstack.org/swift/latest/overview_erasure_code.html
[4]
Alibaba Cloud. 2023. Alibaba Cloud Elastic Compute Service. Retrieved from https://www.alibabacloud.com/product/ecs
[5]
Faraz Ahmad, Srimat T. Chakradhar, Anand Raghunathan, and T. N. Vijaykumar. 2014. ShuffleWatcher: Shuffle-aware scheduling in multi-tenant MapReduce clusters. In 2014 USENIX Annual Technical Conference (USENIX ATC ’14). 1–13.
[6]
Haibo Chen, Heng Zhang, Mingkai Dong, Zhaoguo Wang, Yubin Xia, Haibing Guan, and Binyu Zang. 2017. Efficient and available in-memory KV-store with hybrid erasure coding and replication. ACM Transactions on Storage 13, 3 (2017), 1–30.
[7]
Liangfeng Cheng, Yuchong Hu, and Patrick P. C. Lee. 2019. Coupling decentralized Key-value stores with erasure coding. In Proceedings of the ACM Symposium on Cloud Computing (SoCC).
[8]
Mosharaf Chowdhury, Srikanth Kandula, and Ion Stoica. 2013. Leveraging endpoint flexibility in data-intensive clusters. ACM SIGCOMM Computer Communication Review 43, 4 (2013), 231–242.
[9]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing. 143–154.
[10]
Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in globally distributed storage systems. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10).
[11]
Guowen Gong, Zhirong Shen, Suzhen Wu, Xiaolu Li, and Patrick P. C. Lee. 2021. Optimal rack-coordinated updates in erasure-coded data centers. In IEEE Conference on Computer Communications (IEEE INFOCOM ’21). IEEE, 1–10.
[12]
Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta. 2009. VL2: A scalable and flexible data center network. In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication. 51–62.
[13]
Yuchong Hu, Liangfeng Cheng, Qiaori Yao, Patrick P. C. Lee, Weichun Wang, and Wei Chen. 2021. Exploiting combined locality for wide-stripe erasure coding in distributed storage. In 19th USENIX Conference on File and Storage Technologies (FAST ’21). 233–248.
[14]
Yuchong Hu, Xiaolu Li, Mi Zhang, Patrick P. C. Lee, Xiaoyang Zhang, Pan Zhou, and Dan Feng. 2017. Optimal repair layering for erasure-coded data centers: From theory to practice. ACM Transactions on Storage (TOS) 13, 4 (2017), 1–24.
[15]
Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. 2012. Erasure coding in windows azure storage. In 2012 USENIX Annual Technical Conference (USENIX ATC’12). 15–26.
[16]
Jianzhong Huang, Xianhai Liang, Xiao Qin, Ping Xie, and Changsheng Xie. 2014. Scale-RS: An efficient scaling scheme for RS-coded storage clusters. IEEE Transactions on Parallel and Distributed Systems 26, 6 (2014), 1704–1717.
[17]
Alon Itai, Yehoshua Perl, and Yossi Shiloach. 1982. The complexity of finding maximum disjoint paths with length constraints. Networks 12, 3 (1982), 277–286.
[18]
Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky. 2008. Are disks the dominant contributor for storage failures? A comprehensive study of storage subsystem failure characteristics. ACM Transactions on Storage 4, 3 (2008), 1–25.
[19]
Saurabh Kadekodi, Francisco Maturana, Sanjith Athlur, Arif Merchant, K. V. Rashmi, and Gregory R. Ganger. 2022. Tiger: Disk-adaptive redundancy without placement restrictions. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’22). 413–429.
[20]
Saurabh Kadekodi, K. V. Rashmi, and Gregory R. Ganger. 2019. Cluster storage systems gotta have HeART: improving storage efficiency by exploiting disk-reliability heterogeneity. In 17th USENIX Conference on File and Storage Technologies (FAST’19). 345–358.
[21]
Jérome Lacan and Jérome Fimes. 2004. Systematic MDS erasure codes based on Vandermonde matrices. IEEE Communications Letters 8, 9 (2004), 570–572.
[22]
Francisco Maturana, V. S. Chaitanya Mukka, and K. V. Rashmi. 2020. Access-optimal linear MDS convertible codes for all parameters. In IEEE International Symposium on Information Theory (ISIT’20), IEEE, 577–582.
[23]
Francisco Maturana and K. V. Rashmi. 2020. Convertible codes: New class of codes for efficient conversion of coded data in distributed storage. In Proceedings of the 11th Innovations in Theoretical Computer Science Conference (ITCS’20), Vol. 151. 66.
[24]
Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, and Sanjeev Kumar. 2014. f4: Facebook’s warm BLOB storage system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14), USENIX Association, 383–398.
[25]
Michael Ovsiannikov, Silvius Rus, Damian Reeves, Paul Sutter, Sriram Rao, and Jim Kelly. 2013. The quantcast file system. Proceedings of the VLDB Endowment 6, 11 (2013), 1092–1101.
[26]
Dimitris S. Papailiopoulos and Alexandros G. Dimakis. 2014. Locally repairable codes. IEEE Transactions on Information Theory 60, 10 (2014), 5843–5855.
[27]
James S. Plank, Scott Simmerman, and Catherine D. Schuman. 2008. Jerasure: A Library in C/C++ Facilitating Erasure Coding for Storage Applications-Version 1.2. University of Tennessee, Tech. Rep. CS-08-627 23 (2008).
[28]
Irving S. Reed and Gustave Solomon. 1960. Polynomial codes over certain finite fields. Journal of the Society for Industrial and Applied Mathematics 8, 2 (1960), 300–304.
[29]
Zhirong Shen and Patrick P. C. Lee. 2018. Cross-rack-aware updates in erasure-coded data centers. In Proceedings of the 47th International Conference on Parallel Processing. 1–10.
[30]
Zhirong Shen, Jiwu Shu, Zhijie Huang, and Yingxun Fu. 2020. ClusterSR: Cluster-aware scattered repair in erasure-coded storage. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS’20). IEEE, 42–51.
[31]
Zhirong Shen, Jiwu Shu, and Patrick P. C. Lee. 2016. Reconsidering single failure recovery in clustered file systems. In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’16), IEEE, 323–334.
[32]
Konstantin Taranov, Gustavo Alonso, and Torsten Hoefler. 2018. Fast and strongly-consistent per-item resilience in key-value stores. In Proceedings of the Thirteenth European Conference on Computer Systems (EuroSys). 1–14.
[33]
Robert Endre Tarjan. 1983. Data Structures and Network Algorithms. SIAM.
[34]
Jiguang Wan, Peng Xu, Xubin He, Jibin Wang, Junyao Li, and Changsheng Xie. 2016. H-scale: A fast approach to scale disk arrays via hybrid stripe deployment. ACM Transactions on Storage 12, 3 (2016), 1–30.
[35]
Zizhong Wang, Haixia Wang, Airan Shao, and Dongsheng Wang. 2020. An adaptive erasure-coded storage scheme with an efficient code-switching algorithm. In Proceedings of the 49th International Conference on Parallel Processing. 1–11.
[36]
Hakim Weatherspoon and John D. Kubiatowicz. 2002. Erasure coding vs. replication: A quantitative comparison. In International Workshop on Peer-to-peer Systems. 328–337.
[37]
Chentao Wu, Xubin He, Jizhong Han, Huailiang Tan, and Changsheng Xie. 2012. SDM: A stripe-based data migration scheme to improve the scalability of RAID-6. In 2012 IEEE International Conference on Cluster Computing, IEEE, 284–292.
[38]
Si Wu, Qingpeng Du, Patrick P. C. Lee, Yongkun Li, and Yinlong Xu. 2022. Optimal data placement for stripe merging in locally repairable codes. In IEEE Conference on Computer Communications (IEEE INFOCOM’22). IEEE, 1669–1678.
[39]
Si Wu, Zhirong Shen, and Patrick P. C. Lee. 2020. Enabling I/O-efficient redundancy transitioning in erasure-coded KV stores via elastic Reed-Solomon codes. In International Symposium on Reliable Distributed Systems (SRDS’20), IEEE, 246–255.
[40]
Si Wu, Zhirong Shen, and Patrick P. C. Lee. 2020. On the optimal repair-scaling trade-off in locally repairable codes. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications, IEEE, 2155–2164.
[41]
Mingyuan Xia, Mohit Saxena, Mario Blaum, and David A. Pease. 2015. A tale of two erasure codes in HDFS. In 13th USENIX conference on file and storage technologies (FAST’15). 213–226.
[42]
Erci Xu, Mai Zheng, Feng Qin, Yikang Xu, and Jiesheng Wu. 2019. Lessons and actions: What we learned from 10k SSD-Related storage system failures. In 2019 USENIX Annual Technical Conference (USENIX ATC’19). 961–976.
[43]
Qiaori Yao, Yuchong Hu, Liangfeng Cheng, Patrick P. C. Lee, Dan Feng, Weichun Wang, and Wei Chen. 2021. Stripemerge: Efficient wide-stripe generation for large-scale erasure-coded storage. In 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS’21), IEEE, 483–493.
[44]
Guangyan Zhang, Jiwu Shu, Wei Xue, and Weimin Zheng. 2007. SLAS: An efficient approach to scaling round-robin striped volumes. ACM Transactions on Storage 3, 1 (2007), 3–es.
[45]
Guangyan Zhang, Weiman Zheng, and Jiwu Shu. 2009. ALV: A new data redistribution approach to RAID-5 scaling. IEEE Transactions on Computers 59, 3 (2009), 345–357.
[46]
Xiaoyang Zhang, Yuchong Hu, Patrick P. C. Lee, and Pan Zhou. 2018. Toward optimal storage scaling via network coding: From theory to practice. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, IEEE, 1808–1816.
[47]
Weimin Zheng and Guangyan Zhang. 2011. FastScale: Accelerate RAID scaling by minimizing data migration. In 9th USENIX Conference on File and Storage Technologies (FAST’11).

Index Terms

  1. Achieving Tunable Erasure Coding with Cluster-Aware Redundancy Transitioning

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 21, Issue 3
      September 2024
      592 pages
      EISSN:1544-3973
      DOI:10.1145/3613629
      Issue’s Table of Contents
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 14 September 2024
      Online AM: 10 June 2024
      Accepted: 22 May 2024
      Revised: 29 March 2024
      Received: 06 December 2023
      Published in TACO Volume 21, Issue 3

      Check for updates

      Author Tags

      1. Erasure codes
      2. storage system
      3. fault tolerance
      4. redundancy transitioning

      Qualifiers

      • Research-article

      Funding Sources

      • National Key R&D Program of China
      • Major Research Plan of the National Natural Science Foundation of China
      • Natural Science Foundation of China
      • Natural Science Foundation of Fujian Province of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 468
        Total Downloads
      • Downloads (Last 12 months)468
      • Downloads (Last 6 weeks)173
      Reflects downloads up to 12 Dec 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media