Abstract
Distributed in-memory data processing engines accelerate iterative applications by caching datasets in memory rather than recomputing them in each iteration. Selecting a suitable cluster size for caching these datasets plays an essential role in achieving optimal performance. We present Blink, an autonomous sampling-based framework, which predicts sizes of cached datasets and selects optimal cluster size without relying on historical runs. We evaluate Blink on iterative, real-world, machine learning applications. With an average sample runs cost of 4.6% compared to the cost of optimal runs, Blink selects the optimal cluster size, saving up to 47.4% of execution cost compared to average cost .
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Al-Sayeh, H., Jibril, M.A., Bin Saeed, M.W., Sattler, K.U.: SparkCAD: caching anomalies detector for spark applications. In: VLDB 2022 (2022)
Al-Sayeh, H., Jibril, M.A., Memishi, B., Sattler, K.U.: Blink: lightweight sample runs for cost optimization of big data applications. In: CoRR (2022). https://arxiv.org/abs/2207.02290
Al-Sayeh, H., Memishi, B., Jibril, M.A., Paradies, M., Sattler, K.U.: Juggler: autonomous cost optimization and performance prediction of big data applications. In: ACM SIGMOD 2022 (2022)
Al-Sayeh, H., Memishi, B., Paradies, M., Sattler, K.U.: Masha: sampling-based performance prediction of big data applications in resource-constrained clusters. In: VLDB DISPA 2020 (2020)
Alipourfard, O., Liu, H.H., Chen, J., Venkataraman, S., Yu, M., Zhang, M.: CherryPick: adaptively unearthing the best cloud configurations for big data analytics. In: USENIX NSDI 2017 (2017)
Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: AFIPS 1967 (Spring), Spring Joint Computer Conference (1967)
Chakaravarthy, V.T., Pandit, V., Sabharwal, Y.: Analysis of sampling techniques for association rule mining. In: ICDT 2009 (2009)
Hamidi, H., Mousavi, R.: Analysis and evaluation of a framework for sampling database in recommenders. J. Glob. Inf. Manag. 26, 41–57 (2018)
Kunjir, M., Babu, S.: Black or white? How to develop an AutoTuner for memory-based analytics. In: ACM SIGMOD 2020 (2020)
Li, H., et al.: Detecting cache-related bugs in spark applications. In: ACM SIGSOFT 2020 (2020)
Meng, X., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17, 1235–1241 (2016)
Perez, T.B.G., Zhou, X., Cheng, D.: Reference-distance eviction and prefetching for cache management in spark. In: ACM ICPP 2018 (2018)
Vavilapalli, V.K., et al.: Apache Hadoop YARN: yet another resource negotiator. In: ACM SoCC 2013 (2013)
Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I.: Ernest: efficient performance prediction for large-scale advanced analytics. In: USENIX NSDI 2016 (2016)
Xu, L., Li, M., Zhang, L., Butt, A.R., Wang, Y., Hu, Z.Z.: MEMTUNE: dynamic memory management for in-memory data analytic platforms. In: IEEE IPDPS 2016 (2016)
Yu, Y., Wang, W., Zhang, J., Letaief, K.B.: LRC: dependency-aware cache management for data analytics clusters. In: IEEE INFOCOM 2017 (2017)
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: USENIX NSDI 2012 (2012)
Zhu, Z., Shen, Q., Yang, Y., Wu, Z.: MCS: memory constraint strategy for unified memory manager in spark. In: IEEE ICPADS 2017 (2017)
Acknowledgement
This research was partially funded by the Thuringian Ministry for Economy, Science and Digital Society under the project thurAI and by the Carl-Zeiss-Stiftung under the project MemWerk.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Al-Sayeh, H., Jibril, M.A., Memishi, B., Sattler, KU. (2022). Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications. In: Chiusano, S., et al. New Trends in Database and Information Systems. ADBIS 2022. Communications in Computer and Information Science, vol 1652. Springer, Cham. https://doi.org/10.1007/978-3-031-15743-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-15743-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15742-4
Online ISBN: 978-3-031-15743-1
eBook Packages: Computer ScienceComputer Science (R0)