[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Distributed cache strategy based on LT codes under spark platform

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

During the execution of tasks in a distributed computing system, the master needs to assign new workers to replace lagging workers that usually occur due to various issues such as network problems, hardware failures, software bugs, or high cluster load. If the completed part of tasks by lagging workers is not cached timely, new workers would need to recalculate it, leading to further delays. In the Spark platform, checkpointing allows for the preservation of intermediate results computed by workers in resilient distributed datasets (RDDs) by storing them in Hadoop distributed file system (HDFS). However, storing and retrieving data from HDFS still remains time-consuming. Traditional distributed cache strategies neglect to utilize both coding techniques and different lag probabilities of workers to reduce overhead of caching data. This paper introduces a distributed cache strategy based on Luby transform (LT) codes to enhance system’s effectiveness and robustness. An optimization method is designed within prescribed cache limits to guide the selection of RDDs for caching. Additionally, a novel algorithm is proposed for determining the participation of specific partitions in encoding caching under consideration of worker failure probabilities and overhead of encoding and decoding. In the event of worker failure, intermediate data can be swiftly recovered through decoding, and then the standbys can rapidly take over stragglers’ tasks, hence markedly mitigating overall task latency. Theoretical analysis and experimental results demonstrate that the LT code-based caching strategy exhibits more efficiency and lower latency compared to recomputation, HDFS-based and weight-based caching methods, and it saves much more memory overhead compared to Redis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Algorithm 2
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10)

  2. Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C (2015) Apache tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp 1357–1369

  3. Färber F, Cha SK, Primsch J, Bornhövd C, Sigg S, Lehner W (2012) Sap hana database: data management for modern business applications. ACM Sigmod Record 40(4):45–51

    Article  Google Scholar 

  4. Evans R (2015) Apache storm, a hands on tutorial. In: 2015 IEEE International Conference on Cloud Engineering. IEEE, pp 2–2

  5. Ananthanarayanan G, Ghodsi A, Warfield A, Borthakur D, Kandula S, Shenker S, Stoica I (2012) Pacman: coordinated memory caching for parallel jobs. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp 267–280

  6. Yu M, Li R, Chen Y (2020) A cache replacement policy based on multi-factors for named data networking. Comput Mater Continua 65(1):321–336

    Article  Google Scholar 

  7. Yu Y, Wang W, Zhang J, Letaief KB (2017) Lrc: dependency-aware cache management for data analytics clusters. In: IEEE INFOCOM 2017-IEEE Conference on Computer Communications. IEEE, pp 1–9

  8. Wang B, Tang J, Zhang R, Ding W, Qi D (2018) Lcrc: a dependency-aware cache management policy for spark. In: 2018 IEEE International Conference on Parallel and Distributed Processing with Applications, Ubiquitous Computing and Communications, Big Data and Cloud Computing, Social Computing and Networking, Sustainable Computing and Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). IEEE, pp 956–963

  9. Perez TB, Zhou X, Cheng D (2018) Reference-distance eviction and prefetching for cache management in spark. In: Proceedings of the 47th International Conference on Parallel Processing, pp 1–10

  10. Mattson RL, Gecsei J, Slutz DR, Traiger IL (1970) Evaluation techniques for storage hierarchies. IBM Syst J 9(2):78–117

    Article  Google Scholar 

  11. Li C, Cox AL (2015) Gd-wheel: a cost-aware replacement policy for key-value stores. In: Proceedings of the Tenth European Conference on Computer Systems, pp 1–15

  12. Zhang C (2022) Design and implementation of distributed cache for heterogeneous multilevel storage. PhD thesis, University of Electronic Science and Technology, Chengdu, China

  13. Xia M, Saxena M, Blaum M, Pease DA (2015) A tale of two erasure codes in \(\{\)HDFS\(\}\). In: 13th USENIX Conference on File and Storage Technologies (FAST 15), pp 213–226

  14. Weil S, Brandt SA, Miller EL, Long DD, Maltzahn C (2006) Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Conference on Operating Systems Design and Implementation (OSDI’06), pp 307–320

  15. Reis GA, Chang J, Vachharajani N, Rangan R, August DI (2005) Swift: software implemented fault tolerance. In: International Symposium on Code Generation and Optimization. IEEE, pp 243–254

  16. Zhang X, Cai Y, Liu Y, Xu Z, Dong X (2020) Nade: nodes performance awareness and accurate distance evaluation for degraded read in heterogeneous distributed erasure code-based storage. J Supercomput 76:4946–4975

    Article  Google Scholar 

  17. Reed IS, Solomon G (1960) Polynomial codes over certain finite fields. J Soc Ind Appl Math 8(2):300–304

    Article  MathSciNet  Google Scholar 

  18. Song Y, Yu J, Li B, Li H, He X, Wang J, Zhai R (2022) Rcm: a remote cache management framework for spark. Appl Sci 12(22):11491

    Article  Google Scholar 

  19. Fahim M, Cadambe VR (2021) Numerically stable polynomially coded computing. IEEE Trans Inf Theory 67(5):2758–2785

    Article  MathSciNet  Google Scholar 

  20. Wang S, Liu J, Shroff N (2018) Coded sparse matrix multiplication. In: International Conference on Machine Learning. PMLR, pp 5152–5160

  21. Ramamoorthy A, Tang L (2021) Numerically stable coded matrix computations via circulant and rotation matrix embeddings. IEEE Trans Inf Theory 68(4):2684–2703

    Article  MathSciNet  Google Scholar 

  22. Das AB, Ramamoorthy A, Vaswani N (2021) Efficient and robust distributed matrix computations via convolutional coding. IEEE Trans Inf Theory 67(9):6266–6282

    Article  MathSciNet  Google Scholar 

  23. Subramaniam AM, Heidarzadeh A, Narayanan KR (2019) Random khatri-rao-product codes for numerically-stable distributed matrix multiplication. In: 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, pp 253–259

  24. Li C, Cai Q, Luo Y (2022) Data balancing-based intermediate data partitioning and check point-based cache recovery in spark environment. J Supercomput 78(3):3561–3604

    Article  Google Scholar 

  25. Luby M (2002) Lt codes. In: The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings. IEEE Computer Society, pp 271–271

  26. Liu J, Wang J, Ge Y, Li S, Cui X (2022) A data distribution scheme for vanet based on fountain code. J Supercomput 78(15):16794–16819

    Article  Google Scholar 

  27. Dai Y, Fang Y, Yang L, Jeon G (2016) Graphics processing unit-accelerated joint-bitplane belief propagation algorithm in dsc. J Supercomput 72(6):2351–2375

    Article  Google Scholar 

  28. Adiga S, Xiao X, Tandon R, Vasić B, Bose T (2024) Generalization bounds for neural belief propagation decoders. IEEE Trans Inf Theory. https://doi.org/10.1109/TIT.2024.3361388

    Article  MathSciNet  Google Scholar 

  29. Chen GT, Cao L, Zhao F, Zheng H-f, Pan M (2012) Analysis of robust soliton distribution for lt code. In: 2012 IEEE 11th International Conference on Signal Processing, vol 2. IEEE, pp 1546–1549

  30. Yao W, Yi B, Huang T, Li W (2016) Poisson robust soliton distribution for lt codes. IEEE Commun Lett 20(8):1499–1502

    Article  Google Scholar 

  31. Nakka N, Agrawal A, Choudhary A (2011) Predicting node failure in high performance computing systems from failure and usage logs. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. IEEE, pp 1557–1566

Download references

Funding

This research was supported by the National Key R&D Program of China under Grant No. 2023YFB4503100, the National Natural Science Foundation of China under Grant No. U23B2027 and the China Mobile Strategic R&D Project under Grant No. R23100LX.

Author information

Authors and Affiliations

Authors

Contributions

JS was involved in conceptualization, acquisition of data, methodology, validation, writing—original draft and funding acquisition. YZ took part in methodology, data analysis, software and writing—original draft and reviewing and editing. JW participated in conceptualization, data curation, validation and supervision. ZW was responsible for acquisition of data, data analysis and software. ZX contributed to formal analysis, investigation and resources.

Corresponding author

Correspondence to Yifei Zhang.

Ethics declarations

Conflict of interest

We declare that the authors have no conflict of interest as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shang, J., Zhang, Y., Wang, J. et al. Distributed cache strategy based on LT codes under spark platform. J Supercomput 80, 16519–16545 (2024). https://doi.org/10.1007/s11227-024-06095-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-024-06095-9

Keywords

Navigation