We conduct a series of intensive tests to evaluate SARepair. We choose the greedy algorithm EG [
41] and meta-heuristic algorithm SA [
10] as the reference. For SARepair, the parameter
e is set to 3, the batch size
b is set to 200, and the number of iterations
l is set to 50 for each batch. The parameter of SA will be determined by the
grid search. To evaluate SARepair under real heterogeneous network environments, we select three common hot storage workloads (SWIM, TPC-DS, and TPC-H) in Reference [
34] and randomly select a set of bandwidths situations to set the bandwidth of each node. This method has also been used in recent studies [
39] to mimic heterogeneous networks. We erase one node to mimic the failure and measure the running time, recovery time, recovery throughput, and maximum memory overhead.
1 The running time refers to the generation time of a multi-stripe recovery solution. All results are averaged over five runs, as well as the maximum and minimum values in the figures (some may be invisible as the values are small). The time is measured in seconds, and memory overhead is measured in
MB. We expect to answer the following five questions:
—
What is the impact of parameters e and b on SARepair? (Experiments 2 and 3)
—
Can SARepair quickly generate a solution and effectively reduce recovery time when any cluster configuration varies? (Experiments 1 and 4–6)
—
How much recovery performance can SARepair improve in real-world heterogeneous network environments and rack clusters? (Experiments 7–9 and 11)
—
How do the three designed algorithms of SARepair affect the recovery performance? (Experiment 10)
—
How much memory usage can be reduced in three recovery techniques by Algorithm
2? (Experiment 12)
5.1 Simulations
We implement a simulator for SARepair in Python with about 2,300 SLoC. We deploy the simulator on a server. By default, we randomly distribute the blocks of
\(s=1,000\) stripes with the lost block across
\(n=12\) nodes and use RS(6, 3) (used in QFS [
24] and HDFS [
12]), unless mentioned otherwise. We set block size as
4 MB, which is a moderate size for real erasure-coded clusters [
26]. The simulator does not implement data transmission and coding operations, which are left to be done in Section
5.2. We measure the running time of algorithms and obtain the total recovery time of all failed stripes. All results are averaged over five runs, and the time is measured in seconds.
Experiment 1 (Efficiency of SARepair): Compared with the enumeration method, we first evaluate the efficiency of SARepair. We select RS(6, 3) and vary the number of stripes from 4 to 8. The results are shown in Table
2.
Regarding recovery time, SARepair obtains the optimal recovery solution with the same minimum read time compared to enumeration in \(s = 4, 6\). When \(s = 8\), the recovery time of the optimal recovery solution is less than that of SARepair by at most 4.5% only. About running time, we also observe that the running time of SARepair is negligible, which is less than 0.1 s. On the contrary, the running time of enumeration exponentially enlarges when the number of stripes increases.
Experiment 2 (Impact of parameter \(\boldsymbol {e}\)): We next evaluate SARepair with different parameter
e from 1 to 5. We vary different erasure codes (RS(2, 2), RS(4, 2) (Typical RAID-6 [
35] setup), and RS(6, 3) (used in QFS and HDFS), different number of nodes (from 9 to 15), and different number of stripes (from 500 to 1,500) for testing and further analyze the total time. Due to the large values of the experimental results, some values are not shown in Figure
9(b) and (c).
In Figure
9(a), when
e is small (<3), the running time of SARepair does not change much with
e. When
e>3, the running time of SARepair increases exponentially. We observed that the running times of different erasure codes are very similar, because the time complexity of Algorithm
1 is independent of the encoding parameters (
k and
m). The overall recovery time of SARepair decreases with
e. When
e goes from 1 to 2, the recovery time drops the most. When
\(e=1\), SARepair is similar to a greedy algorithm. When
\(e \gt 1\), SARepair tolerates worse solutions to jump out of the local-optimal.
In Figure
9(b), the more nodes there are, the longer the running time and the shorter the recovery time. The main reason is that SARepair generates the recovery solution by constructing an
\(s \cdot n\) metadata matrix, and the larger the
n is, the more traversal times it takes. Each node will store and transmit fewer blocks of failed stripes when the cluster has more nodes. The recovery time of SARepair decreases dramatically with e when
\(1 \lt e \lt 3\), while that decreases linearly when
\(e \gt 3\).
In Figure
9(c), when
e is greater than 3, the running times significantly increase for
\(s = 1,000\) and
\(s = 1,500\). Moreover, the increase in
e does not significantly reduce the recovery time when handling the small-scale stripes. The reason is that the fewer the number of stripes, the smaller the optimization space for SARepair, which also explains why the corresponding runtime for
\(s=500\) is significantly smaller than
\(s=1,000\) and
\(s=1,500\).
In Figure
9(d), we use the default configurations, measure the total time (running time + recovery time), and provide a breakdown of the total time in terms of the running time and the recovery time. We find that the total time decreases first and then increases sharply with
e, and the turning point of the total time is
\(e=3\). In addition, although the recovery time dominates the total time, the ratio of running time increases with
e. For example, the ratio of running time is 45% when
\(e=5\). Therefore, a large value
e can reduce recovery time at the cost of running time, and vice versa.
Our experiment shows a fundamental tradeoff between recovery efficiency and performance, where parameter e plays a key factor. Therefore, we set \(e=3\) in the experiments to quickly and effectively generate the recovery solution. We suggest that the system designers can reasonably select variable parameters e based on the current system configuration and data status. We can deploy the SARepair with a smaller e in the online or hot storage system and require the failed blocks to be repaired quickly. We can also deploy the SARepair with a larger e in the offline or cold storage system, the center server can spend more time calculating recovery solutions with shorter recovery times, and reduces the occupation of overall cluster resources during the repair process.
Experiment 3 (Impact of batch size \(\boldsymbol {b}\)): The impact under different batch sizes from 50 to 1,000 is further evaluated. We choose the factor of total stripes (e.g., \(s=1,000\)) as the batch size, including b = 50, 100, 200, 250, 500, and 1,000. Note that when b is 1000 (equal to \(s=1,000\)), it is equivalent to repairing all failed stripes concurrently. We again use the default parameters and vary different coding parameters for testing.
As shown in Figure
10(a), the running time of SARepair gradually decreases from 50 to 250 while increasing from 250 to 1000. The difference in runtime under different encoding parameters is not significant. The reason behind this is that the overall time complexity of SARepair is
\(O(s \cdot l \cdot o \cdot e \cdot n + \frac{s^{2} \cdot l}{b})\) (see Section
4.3), which is mainly dominated by other parameters (
\(s, l, o, e, n, b\)). To further analyze the reasons for the trend of curve changes, we measure the optimization round
o for each batch under different batch sizes
b and show the results in Figure
10(b). We observe that
o increases slowly with
b when
\(b=50\) to
\(b=250\), while the optimization round
o increases dramatically with the batch sizes
b when
\(b=500\) and
\(b=1,000\). We indicate that when
b is small, the value of
o is small, and the determinant of the running time is
\(\frac{s^{2} \cdot l}{b}\) (the downward trend of
\(\frac{s^{2} \cdot l}{b}\) is greater than the upward trend of
\(s \cdot l \cdot o \cdot e \cdot n\)), so the running time decreases; when
b is large, the value of
o is large, and the determinant of running time is
\(s \cdot l \cdot o \cdot e \cdot n\) (the decreasing trend of
\(\frac{s^{2} \cdot l}{b}\) is smaller than the increasing trend of
\(s \cdot l \cdot o \cdot e \cdot n\)), so the running time increases.
As shown in Figure
10(c), increasing batch size within a specific range can reduce the recovery time. For example, from 50 to 250, the reduction ratio of recovery time is 18.67%, 27.01%, and 27.78% for RS(2, 2), RS(4, 2), and RS(6, 3). We also find that when
b is greater than 250, a larger batch size will add the recovery time. Batch processing can effectively balance the transmission load among nodes for each batch, thereby further reducing recovery time. To investigate and elucidate this phenomenon, we calculate the standard deviation of the transmission time among nodes for each batch to measure the load balancing among them. The lower this metric is, the more balanced the load is. Figure
10(d) shows the experimental results, and we found that the standard deviation also shows a trend of first decreasing and then increasing. When
b is 50, the standard deviation is highest, and when
b is 250, the standard deviation is lowest. The smaller the value of
b, the more batches there are, which is more likely to cause an imbalance between nodes for each batch, thus increasing the total recovery time; The larger the value of
b, the smaller the optimization space available for stripe replacement within each batch, which may also cause uneven load within the batch and increase recovery time.
Therefore, system designers can measure the optimization round and standard deviation for different batch sizes based on parameter configuration to find turning points and determine the optimal batch size.
Experiment 4 (Impact of coding parameters): We measure the running and recovery time for different coding parameters. We consider three representative configurations for each RS(
k,
m): RS(2, 2), RS(4, 2) (Typical RAID-6 [
35] setup), and RS(6, 3) (used in QFS and HDFS). We make three key observations from the results in Figure
11.
First, as shown in Figure
11(a), SARepair significantly reduces the running time of solution generation. For example, the reduction reaches up to 81.71% and 94.83% compared to EG and SA for parameters (6, 3). This is because SARepair uses the metadata of blocks to generate a recovery solution, which significantly narrows down the solution space to make it practically tractable. Second, as shown in Figure
11(b), SARepair has the lowest recovery time among the three techniques. SARepair reduces the recovery time by 21.74–39.39% and 10.21–18.70% on average compared to EG and SA, respectively. The rationale is that SARepair tolerates the worse solutions and uses a rollback mechanism to reduce recovery time. Finally, the running and recovery time of all three techniques increase when handling larger parameters.
Experiment 5 (Impact of different erasure codes): We finally evaluate SARepair for the non-RS code cluster and validate the efficiency of SARepair using LRC code [
15]. For LRC(
\(k, l, g\)), we select LRC(4, 2, 2) and LRC(6, 2, 2) in this evaluation, which is also used in previous studies [
36,
37]. When the data block or local parity block fails, we repair it within each group, and the helper block is fixed. When the global parity block fails, we repair it by selecting
k blocks.
We find that SARepair maintains the effectiveness of reducing running and recovery time for different erasure codes. It reduces 28.33–93.65% running time and 8.45–24.26% recovery time when compared to the EG and SA in Figure
12. Although LRC code needs all remaining blocks to repair the data block or local parity block, SARepair can reduce the recovery time by carefully arranging appropriate stripes to perform repairs for each repair batch. Therefore, SARepair can adapt to different erasure codes and improve the recovery of the LRC code.
5.2 Amazon EC2 Experiments
We further implement SARepair by extending our simulator (Section
5.1) in C++ with about 2,400 SLoC. We deploy the SARepair on Amazon EC2 [
1] with 13 instances of type
t4g.xlarge within a data center in the Hong Kong region. Each instance is equipped with 4vCPUs and 16 GB RAM and runs Ubuntu 16.04 LTS. The network bandwidth across nodes is around 5 Gb/s (measured by
iperf [
20]). Among the 13 instances, one instance acts as the center server, and the remaining 12 instances are organized to act as storage nodes. We use the Linux command
tc [
21] to limit the link bandwidth for each link, and the bandwidth values are based on three common hot storage workloads in Reference [
34]. We also implement RS codes based on Jerasure 1.2 [
7] and use TCP sockets to realize network transmission. We adopt RS(6, 3) (used in QFS [
24] and HDFS [
12]) and set the block size as
4 MB by default. The average recovery throughput (the amount of repaired data per unit time) and memory overhead of each recovery batch are measured, and the average results over five runs.
Experiment 6 (Ratios of running and recovery time): Figure
13 shows the ratio of time taken by solution generation and data recovery.
We first observe that when the coding parameters increase, the ratio of running time in EG increases, and the running time ratio in SARepair decreases. The rationale is that each optimization of EG requires traversing \(s \cdot C(k+m-1, k)\) solutions, and the time complexity of SARepair is independent of the coding parameters. Although SA can reduce the recovery time by tolerating the worse solution, it needs more time to generate the recovery time. The running time of SA accounts for an average of 39.34% of the total time, which implies that reducing the running time is also crucial.
Experiment 7 (Impact of block size): To investigate how the block size affects the benefit of SARepair, we compare SARepair with EG and SA in terms of recovery throughput under different block sizes. By default, we fix the number of stripes to 200 and use RS(6, 3). Figure
14 shows the recovery throughput for the block sizes from
1 to
16 MB.
We find that the recovery throughput gradually increases along with the block size with all techniques. This is because the larger the block size, the greater the ratio of transmission time in the entire recovery. Therefore, it is crucial to generate effective recovery solutions to reduce transmission time. The larger the block size, the more significant the performance improvement of SARepair in heterogeneous network environments. For example, compared to EG, SARepair improves the performance by 59.97% when the block size is
1 MB and 81.52% when the block size is
16 MB.
Experiment 8 (Impact of the number of nodes): We study the impact of the different number of nodes from 10 to 14. Figure
15 shows the experimental results. In terms of recovery throughput, it increases with the number of nodes. This is because the number of nodes increases, resulting in a more scattered distribution of blocks, and the number of blocks obtained from each node decreases. Overall, SARepair improves 37.50–62.07% and 14.04–42.83% of the recovery throughput compared to EG and SA, respectively. Therefore, the increase of nodes will not affect SARepair.
Experiment 9 (Impact of changing bandwidth): In practical erasure-coded clusters, the network bandwidths of storage nodes are changing, and the bandwidth values may not be obtained accurately. Therefore, it is necessary to ensure that the efficient recovery solution we find under inaccurate bandwidths is still efficient under the actual bandwidths. To simulate the changing network environments, we define the
bandwidth change ratio (BCR) as
In the experiment, we set five different BCR ranges, including [1.0,1.0] (the bandwidth is accurately measured), [0.9,1.1], [0.8,1.2], [0.7,1.3], and [0.6,1.4] to evaluate SARepair. When the center server obtains the current accurate bandwidths, we multiply the current bandwidth value by a random value in the BCR range to limite the bandwidth of each node. We use different coding parameters in Figure
16.
Obviously, the changing network environment with the larger BCR range will incur more influence on the effectiveness of the recovery solution of SARepair. We observe that the changing bandwidth environments only impair the recovery throughput by at most 10.8%. The results prove that SARepair can adapt to the changing bandwidth under the dynamic hot storage clusters.
Experiment 10 (Breakdown analysis of SARepair): In Section
4, we design three algorithms to improve the recovery performance. To analyze the effect of different designed algorithms on performance, we evaluate them individually. Specifically, the three algorithms of SARepair can be abbreviated as follows: (i) RRT: only use the Algorithm
1 to reduce the recovery time, sequentially read the helper block for each node, and randomly select stripes for each batch; (ii) RMO: use SA to generate the recovery solution and adopt the Algorithm
2 to reduce the memory overhead; and (iii) BAT: use SA to generate the recovery solution and adopt the Algorithm
3 to selectively arrange appropriate stripes for each batch. Therefore, SARepair is the synthesis of RRT+RMO+BAT. In addition, we set the maximum memory usage of the center server to 1GB. When the available memory is insufficient, the center server will delay the recovery of some stripes to release the memory. We again use the default settings, and Figure
17 shows the results.
We observe that RRT has the highest recovery throughput in all three designed algorithms, and optimizing the recovery solution is more effective than reducing memory overhead and arranging appropriate stripes. Compared with SA, RRT can promote the recovery performance by 25.91%, while RMO is only 8.02% and BAT is only 17.09%. Moreover, SARepair has the highest recovery throughput and improves the recovery performance by 33.61%. Therefore, the results also prove that the three designed algorithms of SARepair are complementary mutually and do not compromise the effectiveness of each other.
Experiment 11 (Recovery in rack clusters): We further conduct experiment on Amazon EC2 in real rack clusters. As shown in Table
1 in Section
4.1.2, we create a set of instances in Asia Pacific five regions, every three instances reside in the same rack. The network bandwidth between each instance is also shown in Table
1. We adopt RS(4, 2) and RS(6, 3), and randomly distribute blocks to five regions while maintaining rack-level fault tolerance. Figure
18 shows that SARepair reduces the recovery time by 39.62% and 25.14% on average compared to EG and SA, respectively. Therefore, SARepair can still maintain its performance gain in rack clusters.
Experiment 12 (Analysis of memory consumption): We finally evaluate the maximum memory overhead with the different coding parameters and the number of stripes. Figure
19 shows a series of experimental results.
As shown in Figure
19(a), the maximum memory usage of all techniques significantly increases with the coding parameters. This is because the center server needs to read more helper blocks to repair a failed block when handling larger coding parameters. Compared to without using Algorithm
2, the three techniques using Algorithm
2 significantly reduce the maximum memory overhead. For example, Algorithm
2 reduces the maximum memory overhead by 24.09–62.45% on EG, 21.42–39.73% on SA, and 19.39–54.78% on SARepair. As shown in Figure
19(b), consistent with the previous analysis, the optimization of Algorithm
2 can significantly reduce the memory overhead during recovery, which is significant for reducing the server’s workload. Moreover, the maximum memory overhead of SARepair is less than 1GB, which is acceptable for general servers.