Abstract
This paper is mainly to predict the running time of data-intensive MapReduce program under Hadoop2.0 environment. Although MapReduce programs are diverse, they can be divided into data-intensive and computationally intensive, depending on the time complexity and the nature of the program. The prediction of computationally intensive programs has always been difficult, and Hadoop has exhibited certain database attributes that are basically data-intensive. Moreover, the relationship between data-intensive programs and the amount of data is more closely related and shows certain statistical characteristics. So the method of statistical learning is applied to predict the execution time. This paper first generates training data and test data according to requirements, and then selects the appropriate features through the analysis of the logs. The prediction was first performed using the KCCA algorithm. However, the deficiencies were found. Then based on the characteristics of the kernel function, a prediction method based on deep learning was proposed, and the result was significant.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Song, G., Meng, Z., Huet, F., et al.: A hadoop mapreduce performance prediction method. In: IEEE International Conference on High Performance Computing and Communications and 2013 IEEE International Conference on Embedded and Ubiquitous Computing, pp. 820–825. IEEE (2013)
Lin, X., Meng, Z., Xu, C., et al.: A practical performance model for hadoop mapreduce. In: IEEE International Conference on CLUSTER Computing Workshops, pp. 231–239. IEEE (2012)
Khan, M., Jin, Y., Li, M.: Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans. Parallel Distrib. Syst. 27(2), 441–454 (2016)
Liu, Y., Zeng, Y., Piao, X.: High-responsive scheduling with mapreduce performance prediction on hadoop YARN. In: IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, pp. 238–247. IEEE (2016)
Ganapathi, A., Chen, Y., Fox, A., et al.: Statistics-driven workload modeling for the cloud. In: IEEE International Conference on Data Engineering Workshops, pp. 87–92. IEEE (2010)
Bach, F.R., Jordan, M.I.: Kernel independent component analysis. J. Mach. Learn. Res. 3(1), 1–48 (2002)
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2014)
Malekimajd, M., Ardagna, D., Ciavotta, M.: Optimal map reduce job capacity allocation in cloud systems. ACM Sigmetrics Perform. Eval. Rev. 42(4), 51–61 (2015)
Verma, A., Cherkasova, L., Campbell, R.H.: ARIA: automatic resource inference and allocation for mapreduce environments. In: International Conference on Autonomic Computing, ICAC 2011, Karlsruhe, Germany, June 2011, pp. 235–244. DBLP (2011)
Mathiya, B.J., Desai, V.L.: Apache hadoop yarn parameter configuration challenges and optimization. In: International Conference on Soft-Computing and Networks Security, pp. 1–6. IEEE (2015)
Chen, C.O., Zhuo, Y.Q., Yeh, C.C., et al.: Machine learning-based configuration parameter tuning on hadoop system. In: IEEE International Congress on Big Data, pp. 386–392. IEEE Computer Society (2015)
Bei, Z., Yu, Z., Zhang, H., et al.: Hadoop performance prediction model based on random forest. ZTE Commun. 11(2), 38–44 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, H., Li, J., Wang, H. (2018). Statistical Learning-Based Prediction of Execution Time of Data-Intensive Program Under Hadoop2.0. In: Zhou, Q., Gan, Y., Jing, W., Song, X., Wang, Y., Lu, Z. (eds) Data Science. ICPCSEE 2018. Communications in Computer and Information Science, vol 901. Springer, Singapore. https://doi.org/10.1007/978-981-13-2203-7_31
Download citation
DOI: https://doi.org/10.1007/978-981-13-2203-7_31
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2202-0
Online ISBN: 978-981-13-2203-7
eBook Packages: Computer ScienceComputer Science (R0)