Distributed cache strategy based on LT codes under spark platform

Jing Shang¹,
Yifei Zhang¹,
Jibin Wang¹,
Zhihui Wu¹ &
…
Zhiwen Xiao¹

104 Accesses
Explore all metrics

Abstract

During the execution of tasks in a distributed computing system, the master needs to assign new workers to replace lagging workers that usually occur due to various issues such as network problems, hardware failures, software bugs, or high cluster load. If the completed part of tasks by lagging workers is not cached timely, new workers would need to recalculate it, leading to further delays. In the Spark platform, checkpointing allows for the preservation of intermediate results computed by workers in resilient distributed datasets (RDDs) by storing them in Hadoop distributed file system (HDFS). However, storing and retrieving data from HDFS still remains time-consuming. Traditional distributed cache strategies neglect to utilize both coding techniques and different lag probabilities of workers to reduce overhead of caching data. This paper introduces a distributed cache strategy based on Luby transform (LT) codes to enhance system’s effectiveness and robustness. An optimization method is designed within prescribed cache limits to guide the selection of RDDs for caching. Additionally, a novel algorithm is proposed for determining the participation of specific partitions in encoding caching under consideration of worker failure probabilities and overhead of encoding and decoding. In the event of worker failure, intermediate data can be swiftly recovered through decoding, and then the standbys can rapidly take over stragglers’ tasks, hence markedly mitigating overall task latency. Theoretical analysis and experimental results demonstrate that the LT code-based caching strategy exhibits more efficiency and lower latency compared to recomputation, HDFS-based and weight-based caching methods, and it saves much more memory overhead compared to Redis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Article 02 August 2021

Improving the Memory Efficiency of In-Memory MapReduce Based HPC Systems

Redundant Independent Files (RIF): A Technique for Reducing Storage and Resources in Big Data Replication

References

Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10)
Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C (2015) Apache tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp 1357–1369
Färber F, Cha SK, Primsch J, Bornhövd C, Sigg S, Lehner W (2012) Sap hana database: data management for modern business applications. ACM Sigmod Record 40(4):45–51
Article Google Scholar
Evans R (2015) Apache storm, a hands on tutorial. In: 2015 IEEE International Conference on Cloud Engineering. IEEE, pp 2–2
Ananthanarayanan G, Ghodsi A, Warfield A, Borthakur D, Kandula S, Shenker S, Stoica I (2012) Pacman: coordinated memory caching for parallel jobs. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp 267–280
Yu M, Li R, Chen Y (2020) A cache replacement policy based on multi-factors for named data networking. Comput Mater Continua 65(1):321–336
Article Google Scholar
Yu Y, Wang W, Zhang J, Letaief KB (2017) Lrc: dependency-aware cache management for data analytics clusters. In: IEEE INFOCOM 2017-IEEE Conference on Computer Communications. IEEE, pp 1–9
Wang B, Tang J, Zhang R, Ding W, Qi D (2018) Lcrc: a dependency-aware cache management policy for spark. In: 2018 IEEE International Conference on Parallel and Distributed Processing with Applications, Ubiquitous Computing and Communications, Big Data and Cloud Computing, Social Computing and Networking, Sustainable Computing and Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). IEEE, pp 956–963
Perez TB, Zhou X, Cheng D (2018) Reference-distance eviction and prefetching for cache management in spark. In: Proceedings of the 47th International Conference on Parallel Processing, pp 1–10
Mattson RL, Gecsei J, Slutz DR, Traiger IL (1970) Evaluation techniques for storage hierarchies. IBM Syst J 9(2):78–117
Article Google Scholar
Li C, Cox AL (2015) Gd-wheel: a cost-aware replacement policy for key-value stores. In: Proceedings of the Tenth European Conference on Computer Systems, pp 1–15
Zhang C (2022) Design and implementation of distributed cache for heterogeneous multilevel storage. PhD thesis, University of Electronic Science and Technology, Chengdu, China
Xia M, Saxena M, Blaum M, Pease DA (2015) A tale of two erasure codes in \(\{\)HDFS\(\}\). In: 13th USENIX Conference on File and Storage Technologies (FAST 15), pp 213–226
Weil S, Brandt SA, Miller EL, Long DD, Maltzahn C (2006) Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Conference on Operating Systems Design and Implementation (OSDI’06), pp 307–320
Reis GA, Chang J, Vachharajani N, Rangan R, August DI (2005) Swift: software implemented fault tolerance. In: International Symposium on Code Generation and Optimization. IEEE, pp 243–254
Zhang X, Cai Y, Liu Y, Xu Z, Dong X (2020) Nade: nodes performance awareness and accurate distance evaluation for degraded read in heterogeneous distributed erasure code-based storage. J Supercomput 76:4946–4975
Article Google Scholar
Reed IS, Solomon G (1960) Polynomial codes over certain finite fields. J Soc Ind Appl Math 8(2):300–304
Article MathSciNet Google Scholar
Song Y, Yu J, Li B, Li H, He X, Wang J, Zhai R (2022) Rcm: a remote cache management framework for spark. Appl Sci 12(22):11491
Article Google Scholar
Fahim M, Cadambe VR (2021) Numerically stable polynomially coded computing. IEEE Trans Inf Theory 67(5):2758–2785
Article MathSciNet Google Scholar
Wang S, Liu J, Shroff N (2018) Coded sparse matrix multiplication. In: International Conference on Machine Learning. PMLR, pp 5152–5160
Ramamoorthy A, Tang L (2021) Numerically stable coded matrix computations via circulant and rotation matrix embeddings. IEEE Trans Inf Theory 68(4):2684–2703
Article MathSciNet Google Scholar
Das AB, Ramamoorthy A, Vaswani N (2021) Efficient and robust distributed matrix computations via convolutional coding. IEEE Trans Inf Theory 67(9):6266–6282
Article MathSciNet Google Scholar
Subramaniam AM, Heidarzadeh A, Narayanan KR (2019) Random khatri-rao-product codes for numerically-stable distributed matrix multiplication. In: 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, pp 253–259
Li C, Cai Q, Luo Y (2022) Data balancing-based intermediate data partitioning and check point-based cache recovery in spark environment. J Supercomput 78(3):3561–3604
Article Google Scholar
Luby M (2002) Lt codes. In: The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings. IEEE Computer Society, pp 271–271
Liu J, Wang J, Ge Y, Li S, Cui X (2022) A data distribution scheme for vanet based on fountain code. J Supercomput 78(15):16794–16819
Article Google Scholar
Dai Y, Fang Y, Yang L, Jeon G (2016) Graphics processing unit-accelerated joint-bitplane belief propagation algorithm in dsc. J Supercomput 72(6):2351–2375
Article Google Scholar
Adiga S, Xiao X, Tandon R, Vasić B, Bose T (2024) Generalization bounds for neural belief propagation decoders. IEEE Trans Inf Theory. https://doi.org/10.1109/TIT.2024.3361388
Article MathSciNet Google Scholar
Chen GT, Cao L, Zhao F, Zheng H-f, Pan M (2012) Analysis of robust soliton distribution for lt code. In: 2012 IEEE 11th International Conference on Signal Processing, vol 2. IEEE, pp 1546–1549
Yao W, Yi B, Huang T, Li W (2016) Poisson robust soliton distribution for lt codes. IEEE Commun Lett 20(8):1499–1502
Article Google Scholar
Nakka N, Agrawal A, Choudhary A (2011) Predicting node failure in high performance computing systems from failure and usage logs. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. IEEE, pp 1557–1566

Download references

Funding

This research was supported by the National Key R&D Program of China under Grant No. 2023YFB4503100, the National Natural Science Foundation of China under Grant No. U23B2027 and the China Mobile Strategic R&D Project under Grant No. R23100LX.

Author information

Authors and Affiliations

China Mobile Information Technology Center, 2 Guangningbo Street, Xicheng District, Beijing, 100032, China
Jing Shang, Yifei Zhang, Jibin Wang, Zhihui Wu & Zhiwen Xiao

Authors

Jing Shang
View author publications
You can also search for this author in PubMed Google Scholar
Yifei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jibin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhihui Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwen Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JS was involved in conceptualization, acquisition of data, methodology, validation, writing—original draft and funding acquisition. YZ took part in methodology, data analysis, software and writing—original draft and reviewing and editing. JW participated in conceptualization, data curation, validation and supervision. ZW was responsible for acquisition of data, data analysis and software. ZX contributed to formal analysis, investigation and resources.

Corresponding author

Correspondence to Yifei Zhang.

Ethics declarations

Conflict of interest

We declare that the authors have no conflict of interest as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shang, J., Zhang, Y., Wang, J. et al. Distributed cache strategy based on LT codes under spark platform. J Supercomput 80, 16519–16545 (2024). https://doi.org/10.1007/s11227-024-06095-9

Download citation

Accepted: 19 March 2024
Published: 12 April 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11227-024-06095-9

Distributed cache strategy based on LT codes under spark platform

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Improving the Memory Efficiency of In-Memory MapReduce Based HPC Systems

Redundant Independent Files (RIF): A Technique for Reducing Storage and Resources in Big Data Replication

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Distributed cache strategy based on LT codes under spark platform

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Improving the Memory Efficiency of In-Memory MapReduce Based HPC Systems

Redundant Independent Files (RIF): A Technique for Reducing Storage and Resources in Big Data Replication

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation