research-article

Open access

Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks

Authors:

Hai Zhou,

Dan FengAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 21, Issue 3

Article No.: 52, Pages 1 - 24

https://doi.org/10.1145/3664926

Published: 14 September 2024 Publication History

PDF eReader

Abstract

More and more storage systems use erasure code to tolerate faults. It takes pieces of data blocks as input and encodes a small number of parity blocks as output, where these blocks form a stripe. When reconsidering the recovery problem in the multi-stripe level and heterogeneous network clusters, quickly generating an efficient multi-stripe recovery solution that reduces recovery time remains a challenging and time-consuming task. Previous works either use a greedy algorithm that may fall into the local optimal and have low recovery performance or a meta-heuristic algorithm with a long running time and low solution generation efficiency.

In this article, we propose a Stripe-schedule Aware Repair (SARepair) technique for multi-stripe recovery in heterogeneous erasure-coded clusters based on Reed–Solomon code. By carefully examining the metadata of blocks, SARepair intelligently adjusts the recovery solution for each stripe and obtains another multi-stripe solution with less recovery time in a computationally efficient manner. It then tolerates worse solutions to overcome the local optimal and uses a rollback mechanism to adjust search regions to reduce recovery time further. Moreover, instead of reading blocks sequentially from each node, SARepair also selectively schedules the reading order for each block to reduce the memory overhead. We extend SARepair to address the full-node recovery and adapt to the LRC code. We prototype SARepair and show via both simulations and Amazon EC2 experiments that the recovery performance can be improved by up to 59.97% over a state-of-the-art recovery approach while keeping running time and memory overhead low.

1 Introduction

The explosive growth of data volume has brought great pressure on storage systems in the big data era. It causes system node failures to become increasingly frequent, and how to effectively ensure data reliability has become the primary concern of current distributed storage systems. More and more real-world systems use erasure code to tolerate faults, including Microsoft Azure Storage [15], Facebook HDFS storage system [2], and Ceph [4]. In principle, erasure code encodes m parity blocks over the finite field based on the k data blocks and repairs the lost blocks based on the surviving blocks (see Section 2.2). The original data and parity blocks connected by an erasure code form a stripe.

Although attractive in terms of reliability and storage overhead, a major drawback of erasure codes is the expensive recovery traffic. It often needs to retrieve k blocks (called helper) from \(m+k-1\) surviving blocks to repair a single lost block. To reduce the recovery traffic, researchers propose a spate of recovery methods. For example, constructing new coding structures [6, 15] or utilizing recovery techniques [16, 28]. However, extending the recovery methods to a heterogeneous network remains an open problem, and reducing recovery traffic does not always mean reducing recovery time. It is natural for an erasure-coded cluster to be composed of multiple nodes with heterogeneous network environments. The reasons are threefold: (i) Due to new node additions or system upgrades, the different nodes have different bandwidths [37]; (ii) to avoid the failure of an entire cluster, cloud vendors use geo-distributed clusters that span multiple geographic regions, which make the heterogeneity becomes more significant [36]; and (iii) each node usually has multiple tasks running simultaneously in hot storage systems, limiting the available bandwidth for recovery tasks [34]. More importantly, the bandwidth network of the node may not be stable over time, which motivates us to design a technique to complete the recovery processes timely before the bandwidth changes.

In addition, existing studies [3, 37] only solve the single-stripe failure problem, and multi-stripe failures are also very common. The main reasons for multi-stripe failures are as follows. (i) Full-node failure: An observation shows that each node stores around 250K blocks in the Aliyun Pangu System [31]. When one node fails (caused by system upgrades, network disconnections, or disk failure), all the blocks in a node are permanently lost. Because these blocks are distributed in different stripes, the system must simultaneously manipulate the recovery of multiple stripes. (ii) Lazy recovery: To prevent repairing temporary failures, the system only launches the recovery process when the number or duration of failures reaches a predetermined threshold [30]. In practice, most failures are temporary, and those failed blocks may come back in a short period, e.g., 10 minutes [8]. Due to the independence of each stripe, it may have multiple block failures within the time threshold, resulting in multi-stripe failures. Thus, these conditions make the failure recovery more challenging.

Due to files being distributed over different nodes and transferred in parallel, the recovery performance is mainly restricted by the poorly performed nodes in star topology networks. It is intuitive to read fewer (or more) blocks from the nodes with lower (or higher) bandwidths by adjusting the per-stripe solution (which specifies the k helper blocks). Thus, we can adjust the per-stripe solution to reduce the recovery time. Recent studies provide some ideas to solve it. For example, EG [41] uses the greedy algorithm to iteratively adjust the per-stripe solution and generate the multi-stripe recovery solution, but it may fall into the local optimal and cannot achieve a higher recovery performance when facing large-scale stripes. SA [10] exploits a simulated annealing algorithm by introducing the random factors in the solution adjustment process, e.g., selectively accepts a poor solution to overcome the local optimal, to generate the multi-stripe solution. However, finding an efficient solution results in a long running time, which may not be accepted for changing network environments.

Achieving immediate and fast recovery for multi-stripe failures is challenging. Specifically, we seek to achieve the following objectives as follows: (i) high recovery performance (we aim to accelerate data recovery and reduce recovery time to maintain data availability [34, 37]); (ii) fast solution generation (we need to generate the multi-stripe solution quickly, which is essential for the online recovery scenario [28] and changing network environment [34]; and (iii) low memory overhead (repairing a large number of failed stripes simultaneously will inevitably incur competition for limited memory resources, which negates the recovery performance [10]).

To this end, we propose a novel and effective Stripe-schedule Aware Repair (SARepair) recovery technique. SARepair carefully examines the block metadata to find a near-optimal multi-stripe recovery solution while keeping time complexity low. The metadata only takes up a negligible amount of storage overhead and significantly narrows down the solution space. In terms of recovery performance, it then tolerates the worse solutions to overcome the local optimal and uses a rollback mechanism to avoid sticking in roundabout search regions. Therefore, SARepair can quickly generate the recovery solution and achieve high recovery performance simultaneously. Moreover, SARepair designs an optimization algorithm to reduce the memory overhead by selectively dispatching the reading order in each node.

Finally, SARepair comprehensively improves the performance for multi-stripe failures in terms of the solution generation overhead level, the network transmission overhead level, and the memory usage overhead level. To our best knowledge, it is the first work to simultaneously consider the optimization of these three levels for multi-stripe failures recovery in erasure-coded clusters.

Our contributions are summarized as follows.

—

We reconsider the multi-stripe recovery in the heterogeneous network and identify the two-fold problems existing works still need to address, including the generation of effective recovery solutions and the reduction of memory overhead.

—

We propose a novel multi-stripe recovery technique named SARepair. SARepair uses the metadata of blocks and intelligently generates a per-stripe solution. It tolerates the worse solutions and uses a rollback mechanism to improve recovery performance further. We also consider the recovery in rack cluster and exploit a tailored metadata to generate recovery solution. We design an optimization algorithm to reduce memory overhead.

—

We extend SARepair to address the large-scale stripe recovery when the full-node is failed. It uses a batching algorithm and selectively arranges appropriate stripes from all pending failed stripes to perform recovery based on the current bandwidth environment. We emphasize that SARepair can also adapt to the LRC code.

—

We implement and evaluate SARepair via both simulations and Amazon EC2 experiments. Evaluations show that SARepair achieves the 1.59× recovery performance of state-of-the-art and saves up to 62.45% memory overhead while keeping running time low.

The rest of this article proceeds as follows. Section 2 introduces the background. Section 3 presents and formulates the problem. Section 4 describes the design of SARepair. Section 5 evaluates our technique. In Section 6, we discuss the related work. Finally, Section 7 concludes the article.

2 Background

2.1 Heterogeneous Networks

In practice, storage nodes typically have different network bandwidths for real storage cluster, and existing studies rarely consider heterogeneous bandwidth networks. This is for a few reasons. (i) Due to new node additions or regular upgrades of system components, the storage nodes in the same cluster have different computing and network capabilities, making nodes have different speeds for data access [5, 9, 40]; (ii) for the sake of avoiding the failure of an entire cluster, storage vendors (e.g., Amazon EC2 [1] and Google [11]) often adopt geo-distributed clusters that span multiple geographic regions to organize storage nodes, which makes the difference in link bandwidths even more significant [16, 36]; and (iii) meanwhile, although the bandwidth within a cluster is sufficient and stable, each node usually has multiple tasks running simultaneously in a hot storage system. The computing and network resources are often shared by diverse applications, limiting the available bandwidth (the remaining node bandwidth for the new task), even if the link bandwidth is the same [33, 34]. We analyze three common hot storage workloads (SWIM, TPC-DS, and TPC-H) in Reference [34] and show the node’s available bandwidths by boxplots. In Figure 1, we find that the medians of the available bandwidth of SWIM and TPC-H are lower than 8 MB/s, and 50% of the available bandwidths of SWIM range from 0.42 to 30.09 MB/s. In addition, half of the available bandwidths of TPC-DS and TPC-H have a wide range from 0.02 to 81.16 MB/s. Due to the uniform distribution of blocks [32, 33] and the use of parallel technology [19, 36], the slowest node will become the performance bottleneck and prolong the whole recovery time.

Fig. 1.

2.2 Recovery of Erasure-coded

Many data centers use classic erasure coding for high data reliability and storage efficiency. Among many erasure codes, Reed–Solomon (RS) codes [25] are the most popular. RS code encodes k original fixed-size blocks (called data blocks) into m redundant blocks (called parity blocks). For simplicity, we denote the RS code configured by the parameters k and m as RS(\(k,m\)). A collection of \(k+m\) blocks is called a stripe, in which the \(k+m\) blocks are distributed across \(k+m\) storage nodes. Figure 2 illustrates an erasure-coded cluster in star topology networks. It shows three stripes of 12 blocks encoded by RS(2, 2), in which the blocks with the same color belong to the same stripe. Since erasure codes are based on linear coding, the block size is unchanged after the encoding and repairing. We mainly focus on single-block failures for each stripe, because it is the dominating (\(\gt\)98%) case in practice [13]. As shown in Figure 2, to repair the lost block, the center server will read k blocks (called helper) into the memory in parallel (step 1), calculate a repaired block (step 2), and send the repaired block to the node (step 3). When the entire node fails, it repairs multiple stripes contained in failed node and selects a new node that serves as the backup node without running any tasks (called hot-standby node) to replace the failed node [27]. The read time will dominate the recovery time, since the hot-standby node has sufficient network resources to receive the repaired blocks. Thus, this article focuses on reducing the read time (step 1).

Fig. 2.

3 Problem and Formulation

3.1 Open Problems

We have provided an overview of the problems in Section 1, and the following discussions provide more detailed explanations.

Problem 1: How to quickly find an efficient multi-stripe recovery solution?

We first notice that the helper blocks selected for per-stripe directly determine the recovery time. For example, in Figure 3, suppose a cluster of five nodes encoded by RS(2, 2), in which three stripes have lost blocks. Time Cost represents the time of reading one block from the node. To repair the lost block, \(k=2\) helper blocks (the dashed box in Figure 3) are selected from each stripe. Figure 3 shows two multi-stripe recovery solutions, where the first solution needs 10 s (the slowest \(N_{3}\)) to accomplish the recovery, while the latter needs only 6 s (the slowest \(N_{4}\)). This example indicates that a reasonable selection of the recovery solution can reduce the overall recovery time, determined by the node with the maximum read time. However, solving the multi-stripe failures recovery problem is non-trivial.

Fig. 3.

If we consider \(s \gt 1\) stripes, where each stripe has a maximum of \(C(k+m-1, k)\) possible solutions, then the total number of possible multi-stripe solutions will increase to \(C(k+m-1, k)^{s}\). The enumeration approach can involve a significantly large number of trials. EG [41] enumerates the feasible recovery solutions for each stripe and greedily replaces the old solution with another once introducing less read time. Since the greedy algorithm easily falls into the local-optimal, the recovery performance is deficient. SA [10] uses a simulated annealing algorithm to accept the worse solution to overcome the local optimal. However, it has a slow convergence speed and may be stuck in roundabout search regions caused by probabilistic tolerance. How to quickly find an efficient multi-stripe recovery solution will be critical.

Problem 2: How to reduce the memory overhead during recovery?

In addition to adjusting the recovery solution, the memory overhead during recovery must also be considered. Since the server reads data in parallel, different bandwidths cause the time when blocks in each node are read into memory to be inconsistent. When k helper blocks within the stripe are read into memory, the server can aggregate them into an aggregated block and send it to the new node. It makes the first-arriving block reside in memory until all k blocks arrive, which leads to memory usage. We argue that memory usage is limited, because the server has many tasks competing for limited memory resources. When the available memory space is insufficient, it will likely hinder the recovery.

Figure 4 shows a cluster deployed with RS(2, 2) and the triple-stripe recovery solution. For convenience, we only show helper blocks with dashed boxes. Assume that the size of each block is M, and the reading, calculating, and sending processes are parallel. As shown in Figure 4(a), if the server reads the helper blocks in each node sequentially (e.g., the reading order of \(N_{1}\) is \(\lbrace b_{3}\) \(b_{5} \rbrace\)), then the maximum memory overhead appears in the 4th second. Currently, it occupies a total of 5M in memory. Since \(b_{3}\) (first-arriving block) cannot be aggregated with \(b_{4}\) until the 8th second, \(b_{3}\) resides in memory for a long time. As shown in Figure 4(b), if we change the reading order in \(N_{1}\) and \(N_{3}\) (e.g., the reading order of \(N_{1}\) is \(\lbrace b_{5}\) \(b_{3} \rbrace\)), then the maximum memory overhead is only 3M. Therefore, how to schedule the reading order of the block for each node to reduce the maximum memory overhead is also very important for further speeding up the recovery.

Fig. 4.

3.2 Formulation

Based on the above problems, we first establish an optimization model for the multi-stripe recovery problem.

Consider an erasure-coded cluster that deploys an RS(k, m) over n nodes denoted by \(\lbrace\)\(N_{1}\), \(N_{2}\), ..., \(N_{n}\)\(\rbrace\). The cluster contains \(s \ge 1\) stripes with lost block, and each stripe \(S_{i}\) (\(1 \le i \le s\)) spans different \(k+m\) nodes. We use \(b_{i,j}\) to indicate that a block of stripe \(S_{i}\) is located in node \(N_{j}\), and let \(solu_{i}\) be the recovery solution of stripe \(S_{i}\). \(TimeCost_{i}\) represents the time of reading one block from node \(N_{i}\), which is obtained by dividing the block size M by the link bandwidth of \(N_{i}\). \(R(N_{i})\) represents the read time of node \(N_{i}\), which is the number of helpers in \(N_{i}\) multiplied by \(TimeCost_{i}\). According to the principle of parallel technology, the maximum read time \(\alpha\) is determined by the slowest node. Thus, the maximum read time \(\alpha\) required to read all blocks of s stripes is as follows:

\begin{equation} \alpha = max(R(N_{1}), R(N_{2}), \ldots , R(N_{n})). \end{equation}

(1)

As described in Section 2.1, the maximum memory overhead \(\beta\) during recovery also affects the performance. Therefore, we can formulate the following optimization problem:

\begin{equation} Minimize~~\alpha ~~and~~\beta . \end{equation}

(2)

Our optimization goal is to reduce the maximum read time and maximum memory overhead.

4 The Design of SARepair

In this section, we will present the detailed design of SARepair. SARepair has two design objectives:

—

For each failed stripe, SARepair generates an efficient recovery solution to reduce the maximum read time while maintaining a low running time (Section 4.1).

—

For each helper block in the node, SARepair generates a reasonable reading order to reduce the maximum memory overhead (Section 4.2).

—

For the full-node failure, SARepair selectively arranges appropriate stripes in batch to perform repairs and can be extended to LRC code (Section 4.3).

4.1 Reducing Recovery Time

4.1.1 Recovery in General Clusters.

Given the optimization model, we argue that the model has vast solution spaces. To illustrate this, we give the following definition and example.

Definition 1.

If two valid recovery solutions for stripe \(S_{i}\) are \(solu_{i}\) and \(solu_{i}^{^{\prime }}\), then the block \(b_{i,x}\) in node \(N_{x}\) belongs to \(solu_{i}\), and the block \(b_{i,y}\) in node \(N_{y}\) belongs to \(solu_{i}^{^{\prime }}\) \((x \ne y)\). When \(solu_{i}\) \(-\) \(\lbrace\)\(b_{i,x}\)\(\rbrace\) \(=\) \(solu_{i}^{^{\prime }}\) \(-\) \(\lbrace\)\(b_{i,y}\)\(\rbrace\), we say that a change from solution \(solu_{i}\) to \(solu_{i}^{^{\prime }}\) balances node \(N_{x}\) to \(N_{y}\).

Here, we say a recovery solution is valid for a failed stripe if it can repair the failed block (has k helper blocks). Let us take Figure 5(a) as an example. The cluster deploys RS(2, 2) and the stripe \(S_{1}\)\(=\)\(\lbrace\)\(b_{1,1}\), \(b_{1,2}\), \(b_{1,3}\), \(b_{1,4}\)\(\rbrace\) has a failed block \(b_{1,1}\). At this time, two valid solutions for \(S_{1}\) are \(solu_{1}\)\(=\)\(\lbrace\)\(b_{1,2}\), \(b_{1,3}\)\(\rbrace\) and \(solu^{^{\prime }}_{1}\)\(=\)\(\lbrace\)\(b_{1,2}\), \(b_{1,4}\)\(\rbrace\), both of which can repair the failed block \(b_{1,1}\). Note that \(S_{1}\) has a total of \(C(3, 2)=3\) valid solutions. In this case, block \(b_{1,3}\) in node \(N_{3}\) belongs to \(solu_{1}\), and block \(b_{1,4}\) in node \(N_{4}\) belongs to \(solu^{^{\prime }}_{1}\). Since \(solu_{1}\)\(-\)\(\lbrace\)\(b_{1,3}\)\(\rbrace\)\(=\)\(solu^{^{\prime }}_{1}\)\(-\)\(\lbrace\)\(b_{1,4}\)\(\rbrace\)\(=\)\(\lbrace\)\(b_{1,2}\)\(\rbrace\), we say a change from solution \(solu_{1}\) to \(solu^{^{\prime }}_{1}\) balances nodes \(N_{3}\) to \(N_{4}\).

Fig. 5.

We argue that the essence of Definition 1 is to adjust the recovery solution of \(S_{i}\) from \(solu_{i}\) to \(solu^{^{\prime }}_{i}\) by substituting \(b_{i,x}\) with \(b_{i,y}\). After substituting, the helper block in \(N_{x}\) is decreased by one, and the helper block in \(N_{y}\) is increased by one. Given a node \(N_{x}\) with the maximum read time t before substituting, if the read time of \(N_{y}\) is still less than t after substituting, then we can reduce the maximum read time by adjusting the solution of \(S_{i}\). We simply prove this conclusion. \(R(N_{i})\) denotes the read time of \(N_{i}\). Suppose \(R(N_{x})=max(\ldots , R(N_{x}), \ldots , R(N_{y}),\ldots)=t\) and \(R(N_{y}) \lt t\) before substituting. When substituting \(b_{i,x}\) with \(b_{i,y}\), as the helper block in \(N_{x}\) is decreased by one, then \(R(N_{x}) \lt t\). If \(R(N_{y}) \lt t\) after substituting, then it is straightforward that \(max(\ldots , R(N_{x}), \ldots , R(N_{y}), \ldots) \lt t\). Therefore, we can get a new recovery solution that has less read time by adjusting \(S_{i}\). For example, in Figure 5(a), \(N_{3}\) has the maximum read time \(R(N_{3})=t=8s\) and \(R(N_{4})=3s \lt t\) before substituting. As \(R(N_{4})=6s \lt t\) after substituting \(b_{1,3}\) with \(b_{1,4}\) in \(S_{1}\), we reduce the maximum read time to 6s. Next, we use the metadata of blocks to improve the search efficiency for a new recovery solution (e.g., from \(solu_{i}\) to \(solu^{^{\prime }}_{i}\)).

Existing methods need to traverse all \(C(k+m-1, k) \cdot s\) valid solutions in each optimization round and remain computationally expensive. To address the problem, we store the metadata of blocks in a two-dimensional array \(A[i][j]\) (\(1 \le i \le s\), \(1 \le j \le n\)), which represents the state of block. We have four states as follows: 0. no block, 1. non-helper block, 2. helper block, and 3. lost block. For example in Figure 5(a), the recovery solution of \(S_{1}\) is \(solu_{1}\)\(=\)\(\lbrace\)\(b_{1,2}\), \(b_{1,3}\)\(\rbrace\) before substituting, so \(A[1]=\) \(\lbrack 3, 2, 2, 1, 0 \rbrack\). To balance nodes \(N_{3}\) to \(N_{4}\), it only needs to traverse the third and fourth columns of A. If \(A[i][3]=2\) and \(A[i][4]=1\), then it indicates that \(b_{i,3}\) is the helper block and \(b_{i,4}\) is the non-helper block in \(S_{i}\). We can adjust the solution of \(S_{i}\) by substituting \(b_{i,3}\) with \(b_{i,4}\) (e.g., \(i=1\) in Figure 5(a)). In this way, we only locate the node \(N_x\) with the maximum read time, find another node \(N_y\) from the remaining \(n-1\) nodes, and traverse the xth and yth column of A. We can reduce the solution space from \(C(k+m-1, k) \cdot s\) to \(C(n-1, 1) \cdot s\). For example, consider an RS(8, 4) cluster where \(s=100\) stripes are distributed over \(n=15\) nodes. Then existing methods search for \(C(11, 8) \cdot 100 = 16,500\) solutions, and we only traverse \(C(14, 1) \cdot 100 = 1400\) times. The metadata significantly reduces the number of valid solutions being traversed and only consumes a small amount of storage overhead (for \(s=100\), \(n=20\), metadata consumes about 8k). Algorithm 1 shows the details of our main idea.

Algorithm Details: We first randomly select a valid recovery solution \(solu_{i}\) for the ith stripe (lines 1 and 2). Then, we initialize a multi-stripe recovery solution Solu for s stripes and construct a two-dimensional array A with s rows and n columns (line 3). Next, we will optimize the initial Solu. We argue that each optimization may need to adjust the recovery solution of multiple stripes to overcome the local optimal by tolerating worse solutions. Thus, we use one parameter, \(e~(1 \le e~\lt ~n)\), to control the number of adjusted stripes in each optimization. By adjusting the parameter e, we can achieve a flexible tradeoff between the search efficiency and recovery performance. Specifically, in each optimization, we use nSolu and nA to record the current solution and metadata, using \(\mathbb {N}\) to record the balanced nodes that avoid nested loops (line 5). Then, it locates the node \(N_{x}\) with the maximum read time \(\mathit {t}\) and appends the \(N_{x}\) to \(\mathbb {N}\) (lines 6 and 7). Algorithm 1 scans the remaining nodes except the node in \(\mathbb {N}\) and selects one of the node \(N_{y}\) (line 8). We scan the xth and yth column of nA (line 9). If \(nA[i][x]=2\) and \(nA[i][y]=1\), then we can substitute \(b_{i,x}\) with \(b_{i,y}\) in \(nSolu_{i}\) (the recovery solution of \(S_{i}\)) to balance node \(N_{x}\) to \(N_{y}\), so as to reduce the number of helper blocks in \(N_{x}\) and update nA (lines 10–12). We use three parameters \((cx, cy, ci)\) to record the two balanced nodes and adjusted stripes, then append the \(N_{y}\) to \(\mathbb {N}\) (lines 13 and 14). Let \(|\mathbb {N}|\) be the number of nodes in \(\mathbb {N}\). Since \(\mathbb {N}\) appends \(N_{x}\) first (line 7) and \(N_{y}\) in each balancing (line 14), \(|\mathbb {N}|-1\) indicates the number of adjusted stripes.

At this time, we calculate the read time of node \(N_{y}\). If \(R(N_{y}) \lt \mathit {t}\) and \(|\mathbb {N}|-1 \le e\), then we can generate a new multi-stripe recovery solution with less read time and resume another optimization (lines 15–17). Because the number of helpers in \(N_{x}\) decreases by one (\(R(N_{x}) \lt \mathit {t}\) after balancing), and the number of helper blocks in \(N_{y}\) increases by one, the read time of \(N_{y}\) is still less than \(\mathit {t}\) (guaranteed by line 15). If \(R(N_{y}) \ge \mathit {t}\) and \(|\mathbb {N}|-1 \lt e\), then it implies that \(N_{y}\) currently has the maximum read time. Algorithm 1 hopes to tolerate the worse solutions and rescans nodes to get a better multi-stripe solution. It sets \(x=y\) and \(y=1\), then jumps to the for-loop in line 8 and rescans the next node to balance (lines 18–20). Algorithm 1 balances \(N_{x}\) to \(N_{y}\) by the pipeline process (at most e times) until \(R(N_{y}) \lt \mathit {t}\) in the last balancing.

Algorithm 1 may get stuck in roundabout search regions due to tolerating the worse solutions. We use the rollback mechanism to jump out of the roundabout regions and adjust the search regions (lines 21–27). The rollback procedure will be performed in the following two cases: (1) Algorithm 1 cannot find a better Solu after adjusting e stripes (line 21) and (2) when performing the zth balancing (\(1 \le z~\lt ~e\)), Algorithm 1 cannot find \(N_{y}\) to balance \(N_{x}\) to \(N_{y}\) (line 23). Since the currently adjusted stripe is \(S_{ci}\), the algorithm will roll back to the search state before \(S_{ci}\) is adjusted. When the \(S_{ci}\) is adjusted, Algorithm 1 balances \(N_{cx}\) to \(N_{cy}\) by substituting \(b_{ci,cx}\) with \(b_{ci,cy}\) in \(nSolu_{ci}\) and appends \(N_{cy}\) to \(\mathbb {N}\). The rollback procedure performs the above reverse process, jumps to the for-loop in line 8, and re-searches from \(N_{cy+1}\) (lines 24–27). If there is no substitution in Solu, then it exits the while-loop and returns Solu (lines 28 and 29). The maximum read time will be iteratively reduced as Algorithm 1 proceeds.

An Example: Figure 5(a) depicts a case where \(e = 1\). Algorithm 1 locates the node \(N_{3}\) with the maximum read time \(\mathit {t} = 8s\) and selects \(N_{4}\) to balance, because \(A[1][3]=2\) and \(A[1][4]=1\). We can substitute \(b_{1,3}\) with \(b_{1,4}\) in \(solu_{1}\). Hence, \(R(N_{3})=4s\) and \(R(N_{4})=6s(\lt t)\) after substituting. We can generate a new multi-stripe solution with less read time 6s.

Figure 5(b) shows a case where \(e=2\), indicates that at most two stripes are adjusted in each optimization. We first locate node \(N_{2}\) with maximum read time \(\mathit {t} = 8s\) and append \(N_{2}\) to \(\mathbb {N}\). In the first balancing (step ①), \(x=2\), \(y=3\), \(i=1\), \(A[1][2]=2\), and \(A[1][3]=1\). We balance node \(N_{2}\) to \(N_{3}\) by substituting \(b_{1,2}\) with \(b_{1,3}\) in \(solu_{1}\), set \((cx, cy, ci)=(2, 3, 1)\), and append the \(N_{3}\) to \(\mathbb {N}\). Since \(R(N_{3})=9s (\gt \mathit {t})\) and \(|\mathbb {N}|-1=1(\lt e)\) after substituting, we then set \(x=3\) and rescan the next node to balance. In the second balancing (step ②), \(x=3\), \(y=4\), \(i=2\), \(A[2][3]=2\), and \(A[2][4]=1\). We balance node \(N_{3}\) to \(N_{4}\) by substituting \(b_{2,3}\) with \(b_{2,4}\) in \(solu_{2}\), set \((cx, cy, ci)=(3, 4, 2)\), and append the \(N_{4}\) to \(\mathbb {N}\). Since \(R(N_{4})=9s (\gt \mathit {t})\) and \(|\mathbb {N}|-1=2(=e)\) after substituting, it performs the rollback procedure. Specifically, we remove the \(N_{cy=4}\) from \(\mathbb {N}\), set \(x=cx=3\) and \(y=cy+1=5\). It rolls back to the search state before \(S_{ci=2}\) is adjusted and re-searches from node \(N_{cy+1=5}\). In the third balancing (step ③), \(x=3\), \(y=5\), \(i=3\), \(A[3][3]=2\), and \(A[3][5]=1\). We balance node \(N_{3}\) to \(N_{5}\) by substituting \(b_{3,3}\) with \(b_{3,5}\) in \(solu_{3}\) and append the \(N_{5}\) to \(\mathbb {N}\). Since \(R(N_{5})=2s (\lt \mathit {t})\) and \(|\mathbb {N}|-1=2(=e)\) after substituting, we can generate a new solution with less read time 6s.

Time Complexity: We analyze the time complexity. Initializing Solu needs \(O(s)\) time. Suppose that o is the round of optimization, and each optimization takes \(O(e \cdot n \cdot s)\) time. Therefore, the time complexity of Algorithm 1 is \(O(s + o \cdot e \cdot n \cdot s)\).

4.1.2 Recovery in Rack Clusters.

In Section 4.1.1, we only consider a general cluster in that all nodes are in the same region. We next extend SARepair to address the rack clusters, which are more common in real erasure-coded data centres. To better manage a large number of storage nodes and avoid the failure of the entire cluster, cloud storage service providers usually organize these nodes into racks in different geographic regions, so that all storage nodes in the same rack are connected through top switches, and all switches are interconnected through the network server [13, 28]. Figure 6 illustrates a cluster composed of five racks with two nodes each, and each stripe is encoded by RS(4, 2). To tolerate the rack-level failure, a rack can store at most m blocks of the same stripe (e.g., at most \(m = 2\) in Figure 6) [32].

Fig. 6.

In practice, the inner-rack network bandwidth is about 5 or even 20 of inter-rack network bandwidth [14], recent studies [36] also show that the inter-rack transmission time dominates the entire recovery time (about 86%). The results obtained from our iperf [20] measurements on Amazon EC2 [1] in the Asia Pacific are summarized in Table 1. It was noticed that the inter-rack bandwidth is significantly lower compared to the inner-rack bandwidth. Furthermore, we observed a maximum difference of 2.54 times in bandwidth across the five regions. For example, in Table 1, the maximum inter-rack bandwidth is 8.17 (between Tokyo and Seoul), the minimum inter-rack bandwidth is 3.21 (between Sydney and Seoul), and the maximum difference of inter-rack bandwidth across the five regions is 8.17/3.21 = 2.54. Thus, we next focus on reducing the inter-rack recovery time to reduce the total recovery time.

Table 1.

Regions	Hong Kong	Tokyo	Seoul	Singapore	Sydney
Hong Kong	42.37	6.32	5.88	4.29	3.34
Tokyo	6.12	54.96	8.17	4.26	4.51
Seoul	4.67	6.64	38.24	6.55	3.26
Singapore	4.36	3.82	6.34	33.2	7.83
Sydney	3.52	4.22	3.21	6.79	56.83

Table 1. Bandwidth Measurements in Asia Pacific Regions (MB/s)

Design Details: Since the inter-rack bandwidth is the performance bottleneck, the main idea is, first, to generate valid recovery solutions that can achieve the minimum inter-rack recovery traffic by aggregating within the rack and, second, construct the metadata of blocks by adding a new state, finally optimize the recovery solutions by Algorithm 1.

To achieve the minimum inter-rack recovery traffic, we perform the inner-rack aggregation in each rack and generate the intermediate blocks. For example, in Figure 6, if we select \(k=4\) \(\lbrace b_{2,3}, b_{2,4}, b_{2,5}, b_{2,7} \rbrace\) helper blocks to repair the failed block \(b_{2,1}\) in stripe \(S_{2}\), then two blocks \(b_{2,3}\) and \(b_{2,4}\) in the same rack \(RK_{2}\) can generate an intermediate block within the rack and then aggregate with other blocks. Thus, the centre server between racks can only read three blocks for \(S_{2}\) to repair the failed block. However, if we select \(k=4\) \(\lbrace b_{2,3}, b_{2,5}, b_{2,7}, b_{2,10} \rbrace\) helper blocks that come from different racks \(\lbrace RK_{2}, RK_{3}, RK_{4}, RK_{5} \rbrace\), then the server will read four blocks. We argue that the fewer racks that participate in recovery, the smaller the inter-rack recovery traffic. In addition, there may be multiple valid recovery solutions for each stripe with the minimum inter-rack recovery traffic. In Figure 6, the stripe \(S_{2}\) has three valid recovery solutions with the minimum traffic, including \(\lbrace b_{2,3}, b_{2,4}, b_{2,5}, b_{2,7} \rbrace\), \(\lbrace b_{2,3}, b_{2,4}, b_{2,5}, b_{2,10} \rbrace\), \(\lbrace b_{2,3}, b_{2,4}, b_{2,7}, b_{2,10} \rbrace\). This gives us the opportunity to further reduce recovery time.

Next, we construct the metadata of blocks. Let r denote the number of racks in clusters. We store the metadata of blocks in a two-dimensional array \(A[i][j]\) (\(1 \le i \le s\), \(1 \le j \le r\)). We have the following five states: 0. none, 1. non-helper rack, 2. helper rack, 3. lost rack, and 4. select. If \(A[i][j]=0\), then any block of stripe \(S_{i}\) is not stored in rack \(RK_{j}\). If \(A[i][j]=1\), then the block stored in rack \(RK_{j}\) in stripe \(S_{i}\) is not used as the helper block (Contrary to \(A[i][j]=2\)). If \(A[i][j]=3\), then there are failed blocks in rack \(RK_{j}\). If \(A[i][j]=4\), then rack \(RK_{j}\) serves as a helper in all valid solutions with minimum inter-rack recovery traffic in stripe \(S_{i}\). For example, the \(RK_{2}\) in \(S_{2}\) serves as a helper in all valid solutions, thus \(A[2][2]=4\).

Finally, we use Algorithm 1 to adjust the recovery solution. Compared to the general clusters, in r ack clusters, we optimize the transmission time between racks, and for racks with a status of 4, we always use it as a helper rack.

An Example: Figure 6 depicts a case where \(e = 1\). It is obvious that \(RK_{2}\) is always the helper of \(S_{2}\), and \(RK_{3}\) is always the helper of \(S_{3}\), so A[2][2]=4 and A[3][3]=4. Due to the random distribution of blocks in each stripe within the rack, the number of valid recovery solutions for adjustment in each stripe is also different. We can substitute \(b_{1,6}\) with \(b_{1,9}\) in \(S_{1}\), substitute \(b_{2,5}\) with \(b_{2,10}\) in \(S_{2}\), and substitute \(b_{3,7}\) with \(b_{3,9}\) in \(S_{3}\). We can generate a new multi-stripe solution with less read time 6s after Algorithm 1.

4.2 Reducing Memory Overhead

As referred in Section 3.1, sequentially and simultaneously reading all the needed helpers into memory will take up more space. To this end, we designed the Algorithm 2 to reduce memory overhead.

Algorithm Details: Let \(w[i]\) denote the read weight of the solution \(solu_{i}\), and \(Rorder_j\) denote the reading order for node \(N_{j}\). We set \(w[i]=0\) and initialize \(Rorder_{j}\) as the initial reading order for node \(N_{j}\) (lines 1–4). The logical order of the helper blocks in each node determines the initial reading order. For example, in Figure 7, the initial reading order of \(N_{1}\) is \(Rorder_{1} = \lbrace b_{2,1}~b_{3,1} \rbrace\). Then, for each solution \(solu_{i}\), we get the maximum time cost mTimeCost of the node where the block in \(solu_{i}\) is located and scan each helper block \(b_{i,j}\) in \(solu_{i}\) to compute the read weight. The read weight of solution \(solu_{i}\) is calculated as \(w[i] = \sum (mTimeCost-TimeCost_{j})^2\) (lines 5–8). The weight reflects the dispersion degree of the read rate of each block in the same stripe. The lower the weight, the more consistent the arrival time of each block, and vice versa. Each block in \(solu_{i}\) is given a weight \(w[i]\). Finally, we reorder the helper blocks in Rorder (lines 9 and 10). The lower the block weight in the same node, the higher the read priority. The weight of block \(b_{i,j}\) equals the weight of solution \(solu_{i}\).

Fig. 7.

An Example: For the convenience of description, we use the example in Figure 3 (Section 3.1: problem2) to describe Algorithm 2. Figure 7 shows the optimization process of reading order, and we only show the helper block. Algorithm 2 first constructs the initial reading order for Rorder that reads each block sequentially (step ①). Then, it calculates the weight for each stripe (step ②). Finally, we reorder the helper blocks in Rorder (step ③). For example, the weight of \(solu_{3}\) is less than that of \(solu_{2}~(w[3]\lt w[2])\), \(b_{3,1}\) should be read before \(b_{2,1}\) in node \(N_{1}\). For \(N_{3}\), the analysis is the same as above.

Discussion: We further analyze the helper’s status in the center server’s memory and find that there are still blocks that arrived first, remaining in memory and occupying memory space. For example, block \(b_{1,2}\) in Figure 7 is read into memory after the 1st second and encoded together with \(b_{1,2}\) in the 8th second. The time cost of \(N_{2}\) is 1, we can allow \(b_{1,2}\) to read into memory at the 7th second, which can further reduce memory overhead from 3M to 2M. To achieve the above optimization, we first predict the time each block is aggregated and encoded in memory. Assuming blocks \(b_{i,j}\) are encoded and merged at time point t, it can be read from node \(N_{j}\) at time point \(t-TimeCost_{j}\), effectively avoiding some blocks from staying in memory for a long time.

Time Complexity: The complexity of Algorithm 2 is \(O(s + n + s \cdot k + n) \approx O(s \cdot k + n)\).

4.3 Batching in Full-node Recovery

In modern clusters, each node stores a tremendous number of blocks, which belong to different stripes [36]. Therefore, when a full node fails, it triggers a tremendous number of stripes to repair simultaneously. Repairing these stripes simultaneously is usually not preferable, as it easily blows up the network and memory to the center server. In addition, the dynamic and ever-changing application workloads in nodes may change the heterogeneity of network bandwidths, and we need to quickly generate the efficient recovery solution and complete the recovery process before bandwidths change. The current efficient recovery solution may not be effective after a while under rapidly changing network environments in the dynamic storage cluster. Therefore, SARepair batches process a tremendous number of stripes and selectively arranges appropriate stripes from all pending failed stripes to perform recovery based on the current bandwidth environment.

Algorithm Details: Let \(\mathbb {S}\) denote all failed stripes, and b be the number of repaired stripes in a batch. Algorithm 3 selectively selects b stripes to be repaired from \(\mathbb {S}\) and adjusts them to minimize the current read time for each batch. It first randomly selects b stripes \(BS = \lbrace S_{1}, \ldots , S_{b} \rbrace\) from \(\mathbb {S}\) and excludes them from \(\mathbb {S}\) (lines 2 and 3). Algorithm 3 next replaces the current stripes to be repaired in BS over a configurable number of iterations (denoted by l) to minimize the read time for the current batch (lines 4–15). For each iteration, Algorithm 3 first obtains the repair solution Solu via calling Algorithm 1 (Section 4.1) for the current BS. It then selects a stripe to be repaired in \(\mathbb {S}\) to replace a stripe in BS. To quickly find the two stripes, we define the priority of being selected \(Ps_{i}\) and replaced \(Pr_{i}\) for each stripe \(S_{i}\). Specifically, after obtaining Solu in line 5, we can determine the read time for each node. The priority of being selected \(Ps_{i}\) of \(S_{i}\) in \(\mathbb {S}\) is computed by the sum of k minimum read times in the nodes where the remaining \(k+m-1\) blocks of \(S_{i}\) are located. For example, in Figure 8(a), the \(k=2\) nodes with the least read time among the remaining \(k+m-1=3\) blocks of \(S_{4}\) are \(N_{2}\) and \(N_{3}\) (or \(N_{2}\) and \(N_{5}\)), and hence the \(Ps_{4} = 3+4=7\). The priority of being replaced \(Pr_{i}\) of \(S_{i}\) in BS is computed by the sum of read times in the nodes where the k helper blocks of \(S_{i}\) are located. In Figure 8(a), the nodes where the helper blocks for \(S_{3}\) are located are \(N_{4}\) and \(N_{5}\), and hence the \(Pr_{3} = 8+4=12\). After computing the priorities in lines 6 and 7, Algorithm 3 finds \(S_{x}\) in \(\mathbb {S}\) that has the lowest priority of being selected and \(S_{y}\) in BS that has the highest priority of being replaced (line 8). It then tries all valid solutions \(solu_{x}\) of \(S_{x}\) to replace the current solution \(solu_{y}\) of \(S_{y}\) in Solu and generates the new batch solution nSolu (lines 9 and 10). If the new solution nSolu has a lower read time than Solu, then Algorithm 3 substitutes Solu with nSolu, updates \(\mathbb {S}\) and BS, and jumps to the for-loop in line 4 to start the next iteration (line 11–15). After l iterations, it finally obtains Rorder via calling Algorithm 2 (Section 4.2) for Solu and performs the repairs for the current batch BS (line 16-17).

Fig. 8.

Extensions to LRC Code: Besides RS code, we argue that SARepair can also be extended to another representative erasure code called LRC code [15], which has been used in Microsoft Azure Storage Systems. An LRC(k, l, g) divides k blocks into l groups, each group encode one local parity block by \(k/l\) blocks, and the g global parity blocks are generated by k blocks like RS code. For global parity failure, we can repair it by carefully selecting the remaining \(k+g-1\) blocks like the RS code. However, when repairing the data or local parity block in each group, we cannot reduce the recovery time via selecting helper blocks, because each group only tolerates single-failure and needs all the remaining \(k/l\) blocks to repair. Therefore, we can re-organize the recovery of multiple data or local parity block via Algorithm 3, such that the resulting recovery time is reduced.

An Example: Figure 8 shows an example of Algorithm 3 for RS(2, 2) and batch size \(b = 3\). In Figure 8(a), the initial BS is \(\lbrace S_{1}, S_{2}, S_{3} \rbrace\), and the maximum read time is 8s after calling Algorithm 1. It computes the priorities, finds \(S_{3}\) in BS that has the highest priority of being replaced (e.g., \(Pr_{3} = 12\)) and \(S_{4}\) in \(\mathbb {S}\) that has the lowest priority of being selected (e.g., \(Ps_{4} = 4\)). Therefore, Algorithm 3 selects \(S_{4}\) to replace \(S_{3}\) and generates a new recovery solution, such that the maximum read time is 6s (Figure 8(b)). The stripe \(S_{3}\) to be repaired will be reorganized in \(\mathbb {S}\) and repaired in the next rounds.

Time Complexity: We finally analyze the time complexity of Algorithm 3 needs \(\lceil \frac{s}{b} \rceil\) rounds to repair s stripes. For each iteration, it calls Algorithm 1 (\(O(b + o \cdot e \cdot n \cdot b)\)) to obtain Solu for b stripes, takes at most \(O(s)\) time to compute priorities and find two stripes, and takes \(O(C(k+m-1, k))\) times to find a new solution. Finally, Algorithm 3 calls Algorithm 2 (\(O(b+n)\)) to obtain Rorder for each batch. The overall time complexity of Algorithm 3 is \(O(\lceil \frac{s}{b} \rceil \cdot [l \cdot (O(Alg. 1) + s + C(k+m-1, k)) + O(Alg. 2) ]) = O(\lceil \frac{s}{b} \rceil \cdot [l \cdot ((b + o \cdot e \cdot n \cdot b) + s + \frac{(k+m-1)!}{k!m!}) + (b \cdot k + n)])\). Note that Algorithms 1 and 2 process b stripes per batch. In real situations, \(s \gt \frac{(k+m-1)!}{k!m!}\), \(b \cdot k\), n, and b. Therefore, the time complexity of SARepair is approximately equal to \(O(\frac{s \cdot l \cdot (o \cdot e \cdot n \cdot b + s)}{b}) = O(s \cdot l \cdot o \cdot e \cdot n + \frac{s^{2} \cdot l}{b})\), and the time complexity of each batch is \(O(l \cdot (o \cdot e \cdot n \cdot b + s))\).

5 Evaluation

We conduct a series of intensive tests to evaluate SARepair. We choose the greedy algorithm EG [41] and meta-heuristic algorithm SA [10] as the reference. For SARepair, the parameter e is set to 3, the batch size b is set to 200, and the number of iterations l is set to 50 for each batch. The parameter of SA will be determined by the grid search. To evaluate SARepair under real heterogeneous network environments, we select three common hot storage workloads (SWIM, TPC-DS, and TPC-H) in Reference [34] and randomly select a set of bandwidths situations to set the bandwidth of each node. This method has also been used in recent studies [39] to mimic heterogeneous networks. We erase one node to mimic the failure and measure the running time, recovery time, recovery throughput, and maximum memory overhead.¹ The running time refers to the generation time of a multi-stripe recovery solution. All results are averaged over five runs, as well as the maximum and minimum values in the figures (some may be invisible as the values are small). The time is measured in seconds, and memory overhead is measured in MB. We expect to answer the following five questions:

—

What is the impact of parameters e and b on SARepair? (Experiments 2 and 3)

—

Can SARepair quickly generate a solution and effectively reduce recovery time when any cluster configuration varies? (Experiments 1 and 4–6)

—

How much recovery performance can SARepair improve in real-world heterogeneous network environments and rack clusters? (Experiments 7–9 and 11)

—

How do the three designed algorithms of SARepair affect the recovery performance? (Experiment 10)

—

How much memory usage can be reduced in three recovery techniques by Algorithm 2? (Experiment 12)

5.1 Simulations

We implement a simulator for SARepair in Python with about 2,300 SLoC. We deploy the simulator on a server. By default, we randomly distribute the blocks of \(s=1,000\) stripes with the lost block across \(n=12\) nodes and use RS(6, 3) (used in QFS [24] and HDFS [12]), unless mentioned otherwise. We set block size as 4 MB, which is a moderate size for real erasure-coded clusters [26]. The simulator does not implement data transmission and coding operations, which are left to be done in Section 5.2. We measure the running time of algorithms and obtain the total recovery time of all failed stripes. All results are averaged over five runs, and the time is measured in seconds.

Experiment 1 (Efficiency of SARepair): Compared with the enumeration method, we first evaluate the efficiency of SARepair. We select RS(6, 3) and vary the number of stripes from 4 to 8. The results are shown in Table 2.

Table 2.

s	Criteria	Enumeration	SARepair
4	Running time	7.34 s	0.065 s
4	Recovery time	0.89 s	0.89 s
6	Running time	2 min 9 s	0.087 s
6	Recovery time	1.02 s	1.02 s
8	Running time	16 hour 12 min 29 s	0.095 s
8	Recovery time	1.33 s	1.39 s

Table 2. Experiment 1 (the Efficiency of SARepair)

Regarding recovery time, SARepair obtains the optimal recovery solution with the same minimum read time compared to enumeration in \(s = 4, 6\). When \(s = 8\), the recovery time of the optimal recovery solution is less than that of SARepair by at most 4.5% only. About running time, we also observe that the running time of SARepair is negligible, which is less than 0.1 s. On the contrary, the running time of enumeration exponentially enlarges when the number of stripes increases.

Experiment 2 (Impact of parameter \(\boldsymbol {e}\)): We next evaluate SARepair with different parameter e from 1 to 5. We vary different erasure codes (RS(2, 2), RS(4, 2) (Typical RAID-6 [35] setup), and RS(6, 3) (used in QFS and HDFS), different number of nodes (from 9 to 15), and different number of stripes (from 500 to 1,500) for testing and further analyze the total time. Due to the large values of the experimental results, some values are not shown in Figure 9(b) and (c).

Fig. 9.

In Figure 9(a), when e is small (<3), the running time of SARepair does not change much with e. When e>3, the running time of SARepair increases exponentially. We observed that the running times of different erasure codes are very similar, because the time complexity of Algorithm 1 is independent of the encoding parameters (k and m). The overall recovery time of SARepair decreases with e. When e goes from 1 to 2, the recovery time drops the most. When \(e=1\), SARepair is similar to a greedy algorithm. When \(e \gt 1\), SARepair tolerates worse solutions to jump out of the local-optimal.

In Figure 9(b), the more nodes there are, the longer the running time and the shorter the recovery time. The main reason is that SARepair generates the recovery solution by constructing an \(s \cdot n\) metadata matrix, and the larger the n is, the more traversal times it takes. Each node will store and transmit fewer blocks of failed stripes when the cluster has more nodes. The recovery time of SARepair decreases dramatically with e when \(1 \lt e \lt 3\), while that decreases linearly when \(e \gt 3\).

In Figure 9(c), when e is greater than 3, the running times significantly increase for \(s = 1,000\) and \(s = 1,500\). Moreover, the increase in e does not significantly reduce the recovery time when handling the small-scale stripes. The reason is that the fewer the number of stripes, the smaller the optimization space for SARepair, which also explains why the corresponding runtime for \(s=500\) is significantly smaller than \(s=1,000\) and \(s=1,500\).

In Figure 9(d), we use the default configurations, measure the total time (running time + recovery time), and provide a breakdown of the total time in terms of the running time and the recovery time. We find that the total time decreases first and then increases sharply with e, and the turning point of the total time is \(e=3\). In addition, although the recovery time dominates the total time, the ratio of running time increases with e. For example, the ratio of running time is 45% when \(e=5\). Therefore, a large value e can reduce recovery time at the cost of running time, and vice versa.

Our experiment shows a fundamental tradeoff between recovery efficiency and performance, where parameter e plays a key factor. Therefore, we set \(e=3\) in the experiments to quickly and effectively generate the recovery solution. We suggest that the system designers can reasonably select variable parameters e based on the current system configuration and data status. We can deploy the SARepair with a smaller e in the online or hot storage system and require the failed blocks to be repaired quickly. We can also deploy the SARepair with a larger e in the offline or cold storage system, the center server can spend more time calculating recovery solutions with shorter recovery times, and reduces the occupation of overall cluster resources during the repair process.

Experiment 3 (Impact of batch size \(\boldsymbol {b}\)): The impact under different batch sizes from 50 to 1,000 is further evaluated. We choose the factor of total stripes (e.g., \(s=1,000\)) as the batch size, including b = 50, 100, 200, 250, 500, and 1,000. Note that when b is 1000 (equal to \(s=1,000\)), it is equivalent to repairing all failed stripes concurrently. We again use the default parameters and vary different coding parameters for testing.

As shown in Figure 10(a), the running time of SARepair gradually decreases from 50 to 250 while increasing from 250 to 1000. The difference in runtime under different encoding parameters is not significant. The reason behind this is that the overall time complexity of SARepair is \(O(s \cdot l \cdot o \cdot e \cdot n + \frac{s^{2} \cdot l}{b})\) (see Section 4.3), which is mainly dominated by other parameters (\(s, l, o, e, n, b\)). To further analyze the reasons for the trend of curve changes, we measure the optimization round o for each batch under different batch sizes b and show the results in Figure 10(b). We observe that o increases slowly with b when \(b=50\) to \(b=250\), while the optimization round o increases dramatically with the batch sizes b when \(b=500\) and \(b=1,000\). We indicate that when b is small, the value of o is small, and the determinant of the running time is \(\frac{s^{2} \cdot l}{b}\) (the downward trend of \(\frac{s^{2} \cdot l}{b}\) is greater than the upward trend of \(s \cdot l \cdot o \cdot e \cdot n\)), so the running time decreases; when b is large, the value of o is large, and the determinant of running time is \(s \cdot l \cdot o \cdot e \cdot n\) (the decreasing trend of \(\frac{s^{2} \cdot l}{b}\) is smaller than the increasing trend of \(s \cdot l \cdot o \cdot e \cdot n\)), so the running time increases.

Fig. 10.

As shown in Figure 10(c), increasing batch size within a specific range can reduce the recovery time. For example, from 50 to 250, the reduction ratio of recovery time is 18.67%, 27.01%, and 27.78% for RS(2, 2), RS(4, 2), and RS(6, 3). We also find that when b is greater than 250, a larger batch size will add the recovery time. Batch processing can effectively balance the transmission load among nodes for each batch, thereby further reducing recovery time. To investigate and elucidate this phenomenon, we calculate the standard deviation of the transmission time among nodes for each batch to measure the load balancing among them. The lower this metric is, the more balanced the load is. Figure 10(d) shows the experimental results, and we found that the standard deviation also shows a trend of first decreasing and then increasing. When b is 50, the standard deviation is highest, and when b is 250, the standard deviation is lowest. The smaller the value of b, the more batches there are, which is more likely to cause an imbalance between nodes for each batch, thus increasing the total recovery time; The larger the value of b, the smaller the optimization space available for stripe replacement within each batch, which may also cause uneven load within the batch and increase recovery time.

Therefore, system designers can measure the optimization round and standard deviation for different batch sizes based on parameter configuration to find turning points and determine the optimal batch size.

Experiment 4 (Impact of coding parameters): We measure the running and recovery time for different coding parameters. We consider three representative configurations for each RS(k, m): RS(2, 2), RS(4, 2) (Typical RAID-6 [35] setup), and RS(6, 3) (used in QFS and HDFS). We make three key observations from the results in Figure 11.

Fig. 11.

First, as shown in Figure 11(a), SARepair significantly reduces the running time of solution generation. For example, the reduction reaches up to 81.71% and 94.83% compared to EG and SA for parameters (6, 3). This is because SARepair uses the metadata of blocks to generate a recovery solution, which significantly narrows down the solution space to make it practically tractable. Second, as shown in Figure 11(b), SARepair has the lowest recovery time among the three techniques. SARepair reduces the recovery time by 21.74–39.39% and 10.21–18.70% on average compared to EG and SA, respectively. The rationale is that SARepair tolerates the worse solutions and uses a rollback mechanism to reduce recovery time. Finally, the running and recovery time of all three techniques increase when handling larger parameters.

Experiment 5 (Impact of different erasure codes): We finally evaluate SARepair for the non-RS code cluster and validate the efficiency of SARepair using LRC code [15]. For LRC(\(k, l, g\)), we select LRC(4, 2, 2) and LRC(6, 2, 2) in this evaluation, which is also used in previous studies [36, 37]. When the data block or local parity block fails, we repair it within each group, and the helper block is fixed. When the global parity block fails, we repair it by selecting k blocks.

We find that SARepair maintains the effectiveness of reducing running and recovery time for different erasure codes. It reduces 28.33–93.65% running time and 8.45–24.26% recovery time when compared to the EG and SA in Figure 12. Although LRC code needs all remaining blocks to repair the data block or local parity block, SARepair can reduce the recovery time by carefully arranging appropriate stripes to perform repairs for each repair batch. Therefore, SARepair can adapt to different erasure codes and improve the recovery of the LRC code.

Fig. 12.

5.2 Amazon EC2 Experiments

We further implement SARepair by extending our simulator (Section 5.1) in C++ with about 2,400 SLoC. We deploy the SARepair on Amazon EC2 [1] with 13 instances of type t4g.xlarge within a data center in the Hong Kong region. Each instance is equipped with 4vCPUs and 16 GB RAM and runs Ubuntu 16.04 LTS. The network bandwidth across nodes is around 5 Gb/s (measured by iperf [20]). Among the 13 instances, one instance acts as the center server, and the remaining 12 instances are organized to act as storage nodes. We use the Linux command tc [21] to limit the link bandwidth for each link, and the bandwidth values are based on three common hot storage workloads in Reference [34]. We also implement RS codes based on Jerasure 1.2 [7] and use TCP sockets to realize network transmission. We adopt RS(6, 3) (used in QFS [24] and HDFS [12]) and set the block size as 4 MB by default. The average recovery throughput (the amount of repaired data per unit time) and memory overhead of each recovery batch are measured, and the average results over five runs.

Experiment 6 (Ratios of running and recovery time): Figure 13 shows the ratio of time taken by solution generation and data recovery.

Fig. 13.

We first observe that when the coding parameters increase, the ratio of running time in EG increases, and the running time ratio in SARepair decreases. The rationale is that each optimization of EG requires traversing \(s \cdot C(k+m-1, k)\) solutions, and the time complexity of SARepair is independent of the coding parameters. Although SA can reduce the recovery time by tolerating the worse solution, it needs more time to generate the recovery time. The running time of SA accounts for an average of 39.34% of the total time, which implies that reducing the running time is also crucial.

Experiment 7 (Impact of block size): To investigate how the block size affects the benefit of SARepair, we compare SARepair with EG and SA in terms of recovery throughput under different block sizes. By default, we fix the number of stripes to 200 and use RS(6, 3). Figure 14 shows the recovery throughput for the block sizes from 1 to 16 MB.

Fig. 14.

We find that the recovery throughput gradually increases along with the block size with all techniques. This is because the larger the block size, the greater the ratio of transmission time in the entire recovery. Therefore, it is crucial to generate effective recovery solutions to reduce transmission time. The larger the block size, the more significant the performance improvement of SARepair in heterogeneous network environments. For example, compared to EG, SARepair improves the performance by 59.97% when the block size is 1 MB and 81.52% when the block size is 16 MB. Experiment 8 (Impact of the number of nodes): We study the impact of the different number of nodes from 10 to 14. Figure 15 shows the experimental results. In terms of recovery throughput, it increases with the number of nodes. This is because the number of nodes increases, resulting in a more scattered distribution of blocks, and the number of blocks obtained from each node decreases. Overall, SARepair improves 37.50–62.07% and 14.04–42.83% of the recovery throughput compared to EG and SA, respectively. Therefore, the increase of nodes will not affect SARepair.

Fig. 15.

Experiment 9 (Impact of changing bandwidth): In practical erasure-coded clusters, the network bandwidths of storage nodes are changing, and the bandwidth values may not be obtained accurately. Therefore, it is necessary to ensure that the efficient recovery solution we find under inaccurate bandwidths is still efficient under the actual bandwidths. To simulate the changing network environments, we define the bandwidth change ratio (BCR) as

\begin{equation} BCR = \dfrac{the~~inaccurate~~bandwidth}{the~~accurate~~bandwidth}. \end{equation}

(3)

In the experiment, we set five different BCR ranges, including [1.0,1.0] (the bandwidth is accurately measured), [0.9,1.1], [0.8,1.2], [0.7,1.3], and [0.6,1.4] to evaluate SARepair. When the center server obtains the current accurate bandwidths, we multiply the current bandwidth value by a random value in the BCR range to limite the bandwidth of each node. We use different coding parameters in Figure 16.

Fig. 16.

Obviously, the changing network environment with the larger BCR range will incur more influence on the effectiveness of the recovery solution of SARepair. We observe that the changing bandwidth environments only impair the recovery throughput by at most 10.8%. The results prove that SARepair can adapt to the changing bandwidth under the dynamic hot storage clusters. Experiment 10 (Breakdown analysis of SARepair): In Section 4, we design three algorithms to improve the recovery performance. To analyze the effect of different designed algorithms on performance, we evaluate them individually. Specifically, the three algorithms of SARepair can be abbreviated as follows: (i) RRT: only use the Algorithm 1 to reduce the recovery time, sequentially read the helper block for each node, and randomly select stripes for each batch; (ii) RMO: use SA to generate the recovery solution and adopt the Algorithm 2 to reduce the memory overhead; and (iii) BAT: use SA to generate the recovery solution and adopt the Algorithm 3 to selectively arrange appropriate stripes for each batch. Therefore, SARepair is the synthesis of RRT+RMO+BAT. In addition, we set the maximum memory usage of the center server to 1GB. When the available memory is insufficient, the center server will delay the recovery of some stripes to release the memory. We again use the default settings, and Figure 17 shows the results.

Fig. 17.

We observe that RRT has the highest recovery throughput in all three designed algorithms, and optimizing the recovery solution is more effective than reducing memory overhead and arranging appropriate stripes. Compared with SA, RRT can promote the recovery performance by 25.91%, while RMO is only 8.02% and BAT is only 17.09%. Moreover, SARepair has the highest recovery throughput and improves the recovery performance by 33.61%. Therefore, the results also prove that the three designed algorithms of SARepair are complementary mutually and do not compromise the effectiveness of each other.

Experiment 11 (Recovery in rack clusters): We further conduct experiment on Amazon EC2 in real rack clusters. As shown in Table 1 in Section 4.1.2, we create a set of instances in Asia Pacific five regions, every three instances reside in the same rack. The network bandwidth between each instance is also shown in Table 1. We adopt RS(4, 2) and RS(6, 3), and randomly distribute blocks to five regions while maintaining rack-level fault tolerance. Figure 18 shows that SARepair reduces the recovery time by 39.62% and 25.14% on average compared to EG and SA, respectively. Therefore, SARepair can still maintain its performance gain in rack clusters.

Fig. 18.

Experiment 12 (Analysis of memory consumption): We finally evaluate the maximum memory overhead with the different coding parameters and the number of stripes. Figure 19 shows a series of experimental results.

Fig. 19.

As shown in Figure 19(a), the maximum memory usage of all techniques significantly increases with the coding parameters. This is because the center server needs to read more helper blocks to repair a failed block when handling larger coding parameters. Compared to without using Algorithm 2, the three techniques using Algorithm 2 significantly reduce the maximum memory overhead. For example, Algorithm 2 reduces the maximum memory overhead by 24.09–62.45% on EG, 21.42–39.73% on SA, and 19.39–54.78% on SARepair. As shown in Figure 19(b), consistent with the previous analysis, the optimization of Algorithm 2 can significantly reduce the memory overhead during recovery, which is significant for reducing the server’s workload. Moreover, the maximum memory overhead of SARepair is less than 1GB, which is acceptable for general servers.

6 Related Work

There have been extensive studies on improving the recovery performance of erasure-coded clusters via (i) reducing recovery traffic and (ii) accelerating recovery process without reducing traffic.

Reducing Recovery Traffic. Many studies design an efficient coding construction to reduce recovery traffic. For example, Regenerating Codes [6] divide a block into multiple sub-blocks and encode them using complex linear matrices to minimize the recovery traffic. LRC code [15] divides data blocks into multiple groups and repairs a failed block within one group. Compared to the new coding constructions, SARepair mainly aims at the RS code [25] and can potentially support the LRC code (Section 4.3). Another group of works focuses on the rack-architecture cluster and proposes corresponding techniques to reduce the cross-rack recovery traffic. Specifically, CAR [29] and ClusterSR [28] carefully select the helper blocks based on the block layout and perform intra-rack data aggregation in rack clusters to minimize the cross-rack recovery traffic. \(D^{3}\) [17] and PDL [32] design a uniform data layout via combinatorial theory to deterministically distribute blocks and balance the recovery traffic across nodes. In contrast, SARepair does not reduce the recovery traffic and focuses on reducing recovery time. We would like to extend SARepair to the rack-architecture cluster in future works (Section 7).

Accelerating Recovery Process. Many fast recovery techniques are proposed to accelerate the recovery process and reduce recovery time. PPR [22] divides the recovery process into several independent parts and executes them in parallel. RP [16] further makes the recovery process fine-grained and pipelines the transmission, and the complexity of the recovery time approaches \(O(1)\). SMFRepair [37] uses idle nodes to bypass the low-bandwidth links, and PPT [3] and PivotRepair [34] adopt tree-structure strategies to parallelize repair processes in heterogeneous bandwidth networks. However, the above recovery techniques are only suitable for single-stripe recovery, and the multi-stripe failures targeted by SARepair are also common because of full-node failed [19] and lazy repair [30]. RepairBoost [19] proposes a DAG abstraction to achieve the traffic balancing and scheduling for full-node failed. SelectiveEC [33] uses the bipartite graph model and perfect matching theories to improve the recovery performance for single-node failure and heterogeneous network environments. RepairBoost and SelectiveEC design recovery solutions for mesh topology networks [23], while SARepair is aimed at star topology networks [18]. The most closely related works are EG [41] and SA [10]. As a comparison, SARepair can achieve high search efficiency and recovery performance simultaneously.

7 Conclusion

In this article, we propose SARepair, an effective approach that reduces the recovery time and memory overhead. SARepair reasonably chooses the per-stripe recovery solutions across multiple stripes to reduce the recovery time and uses a memory optimization algorithm to reduce the memory overhead. Experiments demonstrate the multi-stripe recovery efficiency of SARepair over the state of the art.

Footnote

This journal version adopts different experiments and network environments. The experimental results in the journal version are different from those in the conference version [38].

References

[1]

Amazon. 2022. Amazon EC2. Retrieved from https://aws.amazon.com/

Abstract

1 Introduction

2 Background

2.1 Heterogeneous Networks

2.2 Recovery of Erasure-coded

3 Problem and Formulation

3.1 Open Problems

3.2 Formulation

4 The Design of SARepair

4.1 Reducing Recovery Time

4.1.1 Recovery in General Clusters.

4.1.2 Recovery in Rack Clusters.

4.2 Reducing Memory Overhead

4.3 Batching in Full-node Recovery

5 Evaluation

5.1 Simulations

5.2 Amazon EC2 Experiments

6 Related Work

7 Conclusion

Footnote

References

Index Terms

Recommendations

Boosting Cross-rack Multi-stripe Repair in Heterogeneous Erasure-coded Clusters

Multi-level Forwarding and Scheduling Repair Technique in Heterogeneous Network for Erasure-coded Clusters

Load-Balanced Recovery Schemes for Single-Disk Failure in Storage Systems with Any Erasure Code

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations