Statistical Learning-Based Prediction of Execution Time of Data-Intensive Program Under Hadoop2.0

Haoran Zhang¹⁴,
Jianzhong Li¹⁴ &
Hongzhi Wang¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 901))

Included in the following conference series:

International Conference of Pioneering Computer Scientists, Engineers and Educators

1589 Accesses

Abstract

This paper is mainly to predict the running time of data-intensive MapReduce program under Hadoop2.0 environment. Although MapReduce programs are diverse, they can be divided into data-intensive and computationally intensive, depending on the time complexity and the nature of the program. The prediction of computationally intensive programs has always been difficult, and Hadoop has exhibited certain database attributes that are basically data-intensive. Moreover, the relationship between data-intensive programs and the amount of data is more closely related and shows certain statistical characteristics. So the method of statistical learning is applied to predict the execution time. This paper first generates training data and test data according to requirements, and then selects the appropriate features through the analysis of the logs. The prediction was first performed using the KCCA algorithm. However, the deficiencies were found. Then based on the characteristics of the kernel function, a prediction method based on deep learning was proposed, and the result was significant.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Innovative Deep-Learning Algorithm for Supporting the Approximate Classification of Workloads in Big Data Environments

Runtime prediction of parallel applications with workload-aware clustering

Article 06 April 2017

Predicting number of threads using balanced datasets for openMP regions

Article Open access 30 April 2022

References

Song, G., Meng, Z., Huet, F., et al.: A hadoop mapreduce performance prediction method. In: IEEE International Conference on High Performance Computing and Communications and 2013 IEEE International Conference on Embedded and Ubiquitous Computing, pp. 820–825. IEEE (2013)
Google Scholar
Lin, X., Meng, Z., Xu, C., et al.: A practical performance model for hadoop mapreduce. In: IEEE International Conference on CLUSTER Computing Workshops, pp. 231–239. IEEE (2012)
Google Scholar
Khan, M., Jin, Y., Li, M.: Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans. Parallel Distrib. Syst. 27(2), 441–454 (2016)
Article Google Scholar
Liu, Y., Zeng, Y., Piao, X.: High-responsive scheduling with mapreduce performance prediction on hadoop YARN. In: IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, pp. 238–247. IEEE (2016)
Google Scholar
Ganapathi, A., Chen, Y., Fox, A., et al.: Statistics-driven workload modeling for the cloud. In: IEEE International Conference on Data Engineering Workshops, pp. 87–92. IEEE (2010)
Google Scholar
Bach, F.R., Jordan, M.I.: Kernel independent component analysis. J. Mach. Learn. Res. 3(1), 1–48 (2002)
MathSciNet MATH Google Scholar
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2014)
Article Google Scholar
Malekimajd, M., Ardagna, D., Ciavotta, M.: Optimal map reduce job capacity allocation in cloud systems. ACM Sigmetrics Perform. Eval. Rev. 42(4), 51–61 (2015)
Article Google Scholar
Verma, A., Cherkasova, L., Campbell, R.H.: ARIA: automatic resource inference and allocation for mapreduce environments. In: International Conference on Autonomic Computing, ICAC 2011, Karlsruhe, Germany, June 2011, pp. 235–244. DBLP (2011)
Google Scholar
Mathiya, B.J., Desai, V.L.: Apache hadoop yarn parameter configuration challenges and optimization. In: International Conference on Soft-Computing and Networks Security, pp. 1–6. IEEE (2015)
Google Scholar
Chen, C.O., Zhuo, Y.Q., Yeh, C.C., et al.: Machine learning-based configuration parameter tuning on hadoop system. In: IEEE International Congress on Big Data, pp. 386–392. IEEE Computer Society (2015)
Google Scholar
Bei, Z., Yu, Z., Zhang, H., et al.: Hadoop performance prediction model based on random forest. ZTE Commun. 11(2), 38–44 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Massive Data Computing Research Center, Harbin Institute of Technology, Xidazhijie. 92, Harbin, China
Haoran Zhang, Jianzhong Li & Hongzhi Wang

Authors

Haoran Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongzhi Wang .

Editor information

Editors and Affiliations

Zhengzhou University, Zhengzhou, Henan, China
Qinglei Zhou
Zhengzhou University of Light Industry, Zhengzhou, Henan, China
Yong Gan
Northeast Forestry University, Harbin, China
Weipeng Jing
Harbin University of Science and Technology, Harbin, China
Xianhua Song
Zhengzhou Institute of Technology, Zhengzhou, China
Yan Wang
National Academy of Guo Ding Institute of Data Science, Beijing, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, H., Li, J., Wang, H. (2018). Statistical Learning-Based Prediction of Execution Time of Data-Intensive Program Under Hadoop2.0. In: Zhou, Q., Gan, Y., Jing, W., Song, X., Wang, Y., Lu, Z. (eds) Data Science. ICPCSEE 2018. Communications in Computer and Information Science, vol 901. Springer, Singapore. https://doi.org/10.1007/978-981-13-2203-7_31

Download citation

DOI: https://doi.org/10.1007/978-981-13-2203-7_31
Published: 09 September 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2202-0
Online ISBN: 978-981-13-2203-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Statistical Learning-Based Prediction of Execution Time of Data-Intensive Program Under Hadoop2.0

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Innovative Deep-Learning Algorithm for Supporting the Approximate Classification of Workloads in Big Data Environments

Runtime prediction of parallel applications with workload-aware clustering

Predicting number of threads using balanced datasets for openMP regions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Statistical Learning-Based Prediction of Execution Time of Data-Intensive Program Under Hadoop2.0

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Innovative Deep-Learning Algorithm for Supporting the Approximate Classification of Workloads in Big Data Environments

Runtime prediction of parallel applications with workload-aware clustering

Predicting number of threads using balanced datasets for openMP regions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation