Abstract
During the execution of tasks in a distributed computing system, the master needs to assign new workers to replace lagging workers that usually occur due to various issues such as network problems, hardware failures, software bugs, or high cluster load. If the completed part of tasks by lagging workers is not cached timely, new workers would need to recalculate it, leading to further delays. In the Spark platform, checkpointing allows for the preservation of intermediate results computed by workers in resilient distributed datasets (RDDs) by storing them in Hadoop distributed file system (HDFS). However, storing and retrieving data from HDFS still remains time-consuming. Traditional distributed cache strategies neglect to utilize both coding techniques and different lag probabilities of workers to reduce overhead of caching data. This paper introduces a distributed cache strategy based on Luby transform (LT) codes to enhance system’s effectiveness and robustness. An optimization method is designed within prescribed cache limits to guide the selection of RDDs for caching. Additionally, a novel algorithm is proposed for determining the participation of specific partitions in encoding caching under consideration of worker failure probabilities and overhead of encoding and decoding. In the event of worker failure, intermediate data can be swiftly recovered through decoding, and then the standbys can rapidly take over stragglers’ tasks, hence markedly mitigating overall task latency. Theoretical analysis and experimental results demonstrate that the LT code-based caching strategy exhibits more efficiency and lower latency compared to recomputation, HDFS-based and weight-based caching methods, and it saves much more memory overhead compared to Redis.
Similar content being viewed by others
References
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10)
Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C (2015) Apache tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp 1357–1369
Färber F, Cha SK, Primsch J, Bornhövd C, Sigg S, Lehner W (2012) Sap hana database: data management for modern business applications. ACM Sigmod Record 40(4):45–51
Evans R (2015) Apache storm, a hands on tutorial. In: 2015 IEEE International Conference on Cloud Engineering. IEEE, pp 2–2
Ananthanarayanan G, Ghodsi A, Warfield A, Borthakur D, Kandula S, Shenker S, Stoica I (2012) Pacman: coordinated memory caching for parallel jobs. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp 267–280
Yu M, Li R, Chen Y (2020) A cache replacement policy based on multi-factors for named data networking. Comput Mater Continua 65(1):321–336
Yu Y, Wang W, Zhang J, Letaief KB (2017) Lrc: dependency-aware cache management for data analytics clusters. In: IEEE INFOCOM 2017-IEEE Conference on Computer Communications. IEEE, pp 1–9
Wang B, Tang J, Zhang R, Ding W, Qi D (2018) Lcrc: a dependency-aware cache management policy for spark. In: 2018 IEEE International Conference on Parallel and Distributed Processing with Applications, Ubiquitous Computing and Communications, Big Data and Cloud Computing, Social Computing and Networking, Sustainable Computing and Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). IEEE, pp 956–963
Perez TB, Zhou X, Cheng D (2018) Reference-distance eviction and prefetching for cache management in spark. In: Proceedings of the 47th International Conference on Parallel Processing, pp 1–10
Mattson RL, Gecsei J, Slutz DR, Traiger IL (1970) Evaluation techniques for storage hierarchies. IBM Syst J 9(2):78–117
Li C, Cox AL (2015) Gd-wheel: a cost-aware replacement policy for key-value stores. In: Proceedings of the Tenth European Conference on Computer Systems, pp 1–15
Zhang C (2022) Design and implementation of distributed cache for heterogeneous multilevel storage. PhD thesis, University of Electronic Science and Technology, Chengdu, China
Xia M, Saxena M, Blaum M, Pease DA (2015) A tale of two erasure codes in \(\{\)HDFS\(\}\). In: 13th USENIX Conference on File and Storage Technologies (FAST 15), pp 213–226
Weil S, Brandt SA, Miller EL, Long DD, Maltzahn C (2006) Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Conference on Operating Systems Design and Implementation (OSDI’06), pp 307–320
Reis GA, Chang J, Vachharajani N, Rangan R, August DI (2005) Swift: software implemented fault tolerance. In: International Symposium on Code Generation and Optimization. IEEE, pp 243–254
Zhang X, Cai Y, Liu Y, Xu Z, Dong X (2020) Nade: nodes performance awareness and accurate distance evaluation for degraded read in heterogeneous distributed erasure code-based storage. J Supercomput 76:4946–4975
Reed IS, Solomon G (1960) Polynomial codes over certain finite fields. J Soc Ind Appl Math 8(2):300–304
Song Y, Yu J, Li B, Li H, He X, Wang J, Zhai R (2022) Rcm: a remote cache management framework for spark. Appl Sci 12(22):11491
Fahim M, Cadambe VR (2021) Numerically stable polynomially coded computing. IEEE Trans Inf Theory 67(5):2758–2785
Wang S, Liu J, Shroff N (2018) Coded sparse matrix multiplication. In: International Conference on Machine Learning. PMLR, pp 5152–5160
Ramamoorthy A, Tang L (2021) Numerically stable coded matrix computations via circulant and rotation matrix embeddings. IEEE Trans Inf Theory 68(4):2684–2703
Das AB, Ramamoorthy A, Vaswani N (2021) Efficient and robust distributed matrix computations via convolutional coding. IEEE Trans Inf Theory 67(9):6266–6282
Subramaniam AM, Heidarzadeh A, Narayanan KR (2019) Random khatri-rao-product codes for numerically-stable distributed matrix multiplication. In: 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, pp 253–259
Li C, Cai Q, Luo Y (2022) Data balancing-based intermediate data partitioning and check point-based cache recovery in spark environment. J Supercomput 78(3):3561–3604
Luby M (2002) Lt codes. In: The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings. IEEE Computer Society, pp 271–271
Liu J, Wang J, Ge Y, Li S, Cui X (2022) A data distribution scheme for vanet based on fountain code. J Supercomput 78(15):16794–16819
Dai Y, Fang Y, Yang L, Jeon G (2016) Graphics processing unit-accelerated joint-bitplane belief propagation algorithm in dsc. J Supercomput 72(6):2351–2375
Adiga S, Xiao X, Tandon R, Vasić B, Bose T (2024) Generalization bounds for neural belief propagation decoders. IEEE Trans Inf Theory. https://doi.org/10.1109/TIT.2024.3361388
Chen GT, Cao L, Zhao F, Zheng H-f, Pan M (2012) Analysis of robust soliton distribution for lt code. In: 2012 IEEE 11th International Conference on Signal Processing, vol 2. IEEE, pp 1546–1549
Yao W, Yi B, Huang T, Li W (2016) Poisson robust soliton distribution for lt codes. IEEE Commun Lett 20(8):1499–1502
Nakka N, Agrawal A, Choudhary A (2011) Predicting node failure in high performance computing systems from failure and usage logs. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. IEEE, pp 1557–1566
Funding
This research was supported by the National Key R&D Program of China under Grant No. 2023YFB4503100, the National Natural Science Foundation of China under Grant No. U23B2027 and the China Mobile Strategic R&D Project under Grant No. R23100LX.
Author information
Authors and Affiliations
Contributions
JS was involved in conceptualization, acquisition of data, methodology, validation, writing—original draft and funding acquisition. YZ took part in methodology, data analysis, software and writing—original draft and reviewing and editing. JW participated in conceptualization, data curation, validation and supervision. ZW was responsible for acquisition of data, data analysis and software. ZX contributed to formal analysis, investigation and resources.
Corresponding author
Ethics declarations
Conflict of interest
We declare that the authors have no conflict of interest as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shang, J., Zhang, Y., Wang, J. et al. Distributed cache strategy based on LT codes under spark platform. J Supercomput 80, 16519–16545 (2024). https://doi.org/10.1007/s11227-024-06095-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-024-06095-9