Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1652))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

1166 Accesses
4 Citations

Abstract

Distributed in-memory data processing engines accelerate iterative applications by caching datasets in memory rather than recomputing them in each iteration. Selecting a suitable cluster size for caching these datasets plays an essential role in achieving optimal performance. We present Blink, an autonomous sampling-based framework, which predicts sizes of cached datasets and selects optimal cluster size without relying on historical runs. We evaluate Blink on iterative, real-world, machine learning applications. With an average sample runs cost of 4.6% compared to the cost of optimal runs, Blink selects the optimal cluster size, saving up to 47.4% of execution cost compared to average cost .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 79.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 99.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Impala

Scalability and Realtime on Big Data, MapReduce, NoSQL and Spark

Dataset Popularity Prediction for Caching of CMS Big Data

Article 26 February 2018

References

Al-Sayeh, H., Jibril, M.A., Bin Saeed, M.W., Sattler, K.U.: SparkCAD: caching anomalies detector for spark applications. In: VLDB 2022 (2022)
Google Scholar
Al-Sayeh, H., Jibril, M.A., Memishi, B., Sattler, K.U.: Blink: lightweight sample runs for cost optimization of big data applications. In: CoRR (2022). https://arxiv.org/abs/2207.02290
Al-Sayeh, H., Memishi, B., Jibril, M.A., Paradies, M., Sattler, K.U.: Juggler: autonomous cost optimization and performance prediction of big data applications. In: ACM SIGMOD 2022 (2022)
Google Scholar
Al-Sayeh, H., Memishi, B., Paradies, M., Sattler, K.U.: Masha: sampling-based performance prediction of big data applications in resource-constrained clusters. In: VLDB DISPA 2020 (2020)
Google Scholar
Alipourfard, O., Liu, H.H., Chen, J., Venkataraman, S., Yu, M., Zhang, M.: CherryPick: adaptively unearthing the best cloud configurations for big data analytics. In: USENIX NSDI 2017 (2017)
Google Scholar
Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: AFIPS 1967 (Spring), Spring Joint Computer Conference (1967)
Google Scholar
Chakaravarthy, V.T., Pandit, V., Sabharwal, Y.: Analysis of sampling techniques for association rule mining. In: ICDT 2009 (2009)
Google Scholar
Hamidi, H., Mousavi, R.: Analysis and evaluation of a framework for sampling database in recommenders. J. Glob. Inf. Manag. 26, 41–57 (2018)
Article Google Scholar
Kunjir, M., Babu, S.: Black or white? How to develop an AutoTuner for memory-based analytics. In: ACM SIGMOD 2020 (2020)
Google Scholar
Li, H., et al.: Detecting cache-related bugs in spark applications. In: ACM SIGSOFT 2020 (2020)
Google Scholar
Meng, X., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17, 1235–1241 (2016)
MathSciNet MATH Google Scholar
Perez, T.B.G., Zhou, X., Cheng, D.: Reference-distance eviction and prefetching for cache management in spark. In: ACM ICPP 2018 (2018)
Google Scholar
Vavilapalli, V.K., et al.: Apache Hadoop YARN: yet another resource negotiator. In: ACM SoCC 2013 (2013)
Google Scholar
Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I.: Ernest: efficient performance prediction for large-scale advanced analytics. In: USENIX NSDI 2016 (2016)
Google Scholar
Xu, L., Li, M., Zhang, L., Butt, A.R., Wang, Y., Hu, Z.Z.: MEMTUNE: dynamic memory management for in-memory data analytic platforms. In: IEEE IPDPS 2016 (2016)
Google Scholar
Yu, Y., Wang, W., Zhang, J., Letaief, K.B.: LRC: dependency-aware cache management for data analytics clusters. In: IEEE INFOCOM 2017 (2017)
Google Scholar
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: USENIX NSDI 2012 (2012)
Google Scholar
Zhu, Z., Shen, Q., Yang, Y., Wu, Z.: MCS: memory constraint strategy for unified memory manager in spark. In: IEEE ICPADS 2017 (2017)
Google Scholar

Download references

Acknowledgement

This research was partially funded by the Thuringian Ministry for Economy, Science and Digital Society under the project thurAI and by the Carl-Zeiss-Stiftung under the project MemWerk.

Author information

Authors and Affiliations

TU Ilmenau, Ilmenau, Germany
Hani Al-Sayeh, Muhammad Attahir Jibril & Kai-Uwe Sattler
Riinvest College, Pristina, Kosovo
Bunjamin Memishi

Authors

Hani Al-Sayeh
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Attahir Jibril
View author publications
You can also search for this author in PubMed Google Scholar
Bunjamin Memishi
View author publications
You can also search for this author in PubMed Google Scholar
Kai-Uwe Sattler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hani Al-Sayeh .

Editor information

Editors and Affiliations

Politecnico di Torino, Turin, Italy
Silvia Chiusano
Politecnico di Torino, Turin, Italy
Tania Cerquitelli
Poznań University of Technology, Poznań, Poland
Robert Wrembel
Norwegian University of Science and Technology, Trondheim, Norway
Kjetil Nørvåg
University of Genoa, Genoa, Italy
Barbara Catania
CNRS, Villeurbanne Cedex, France
Genoveva Vargas-Solar
University of Calabria, Rende, Italy
Ester Zumpano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Al-Sayeh, H., Jibril, M.A., Memishi, B., Sattler, KU. (2022). Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications. In: Chiusano, S., et al. New Trends in Database and Information Systems. ADBIS 2022. Communications in Computer and Information Science, vol 1652. Springer, Cham. https://doi.org/10.1007/978-3-031-15743-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-15743-1_14
Published: 29 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15742-4
Online ISBN: 978-3-031-15743-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

Abstract

Access this chapter

Subscribe and save

Buy Now