Abstract
Apache Hadoop is a widely used distributed system in large-scale production environment. With the increasing size of data volume and cluster scale, its performance is limited by inappropriate resources utilization. This paper introduces a resources utilization predictor (RUPredHadoop) to predict utilization of cpu, memory, read/write rate of disk and network, especially for large-scale Hadoop clusters. In terms of the similarity of data and workflow in Hadoop, the pattern of resource utilization for a single task is proposed, and then formulized by a single task model. Besides that, the distribution of fine-grained runtime is studied, so that a parallel-batch-tasks-based model could regenerate the whole Mapreduce job by migrating the single task model from the minimum cluster to a large-scale production cluster. With RUPredHadoop, we can locate the resource bottleneck for Hadoop clusters, meanwhile we can agilely configure clusters for applications with massive data. The performance of RUPredHadoop is validated by a test cluster with 35 nodes and a production cluster with 80 nodes. Results show that the normalization error is below 10% for benchmark applications with maximum 100 TB data.
This paper is partially supported by the National key research and development program of China (No. 2017YFB1400300), the National Natural Science Foundation of China (No. 61573292), State Key Laboratory of Rail Transit Engineering Informatization (FSDI) (No. SKLK16-04) .
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Parmar, R.R., Roy, S., Bhattacharyya, D., Bandyopadhyay, S.K., Kim, T.H.: Large-scale encryption in the hadoop environment: challenges and solutions. IEEE Access 5, 7156–7163 (2017)
Herodotou, H.: Hadoop performance models. arXiv preprint. arXiv:1106.0940 (2011)
Verma, A., Cherkasova, L., Campbell, R.H.: Play it again, SimMR!. In: Proceedings of IEEE International Conference on CLUSTER Computing, vol. 8, no. 1, pp. 253–261 (2011)
Liu, N., Yang, X., Sun, X.H., Jenkins, J., Ross, R.: YARNsim: simulating hadoop YARN. In: Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 637–646 (2015)
Teng, F., Yu, L., Magoulès, F.: SimMapReduce: a simulator for modeling MapReduce framework. In: Proceedings of the 2011 Fifth FTRA International Conference on Multimedia and Ubiquitous Engineering (MUE 2011), pp. 277–282. IEEE Computer Society (2011)
Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: Proceedings of the 15th Biennial Conference on Innovative Data Systems Research, pp. 261–272 (2011)
Yigitbasi, N., Willke, T.L., Liao, G., Epema, D.: Towards machine learning-based auto-tuning of MapReduce. In: Proceedings of the 2013 IEEE 21st International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 11–20. IEEE Computer Society (2013)
Li, M., et al.: MRONLINE: MapReduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176 (2014)
Ganglia Monitoring System: Ganglia (2016). http://ganglia.sourceforge.net/. Accessed 10 Oct 2016
Nagios (2016). https://www.nagios.org/. Accessed 10 Oct 2016
Apache Ambari: Ambari (2016). https://ambari.apache.org. Accessed 07 Apr 2017
LinkedIn dr-elephant (2016). https://github.com/linkedin/dr-elephant. Accessed 07 Apr 2017
Wang, G., Butt, A.R., Pandey, P., Gupta, K.: A simulation approach to evaluating design decisions in MapReduce setups. In: IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 1–11 (2009)
Wang, G., Butt, A.R., Pandey, P., Gupta, K.: Using realistic simulation for performance analysis of MapReduce setups. In: Proceedings of the 1st ACM Workshop on Large-Scale System and Application Performance, pp. 19–26 (2009)
Apache: Mumak: Map-Reduce Simulator-ASF JIRA (2009). https://issues.apache.org/jira/browse/MAPREDUCE-728. Accessed 21 Apr 2017
Hammoud, S., Li, M., Liu, Y., Alham, N.K., Liu, Z.: MRSim: a discrete event based MapReduce simulator. In: Proceedings of the 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), vol. 6, pp. 2993–2997 (2010)
Apache: Rumen: a tool to extract job characterization data from job tracker logs (2010). https://issues.apache.org/jira/browse/MAPREDUCE-751. Accessed 21 Apr 2017
Howell, F., McNab, R.: SimJava: a discrete event simulation library for Java. Simul. Ser. 30, 51–56 (1998)
Buyya, R., Murshed, M.: GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurr. Comput.: Pract. Exp. 14(13–15), 1175–1220 (2002)
Herodotou, H., Dong, F., Babu, S.: MapReduce programming and cost-based optimization? Crossing this chasm with starfish. Proc. VLDB Endow. 4(12), 1446–1449 (2011)
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. In: Encyclopedia of Database Systems, vol. 4, no. 11, pp. 1111–1122 (2011)
Apache: Apache hadoop (2017). http://hadoop.apache.org. Accessed 09 Oct 2016
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the Sixth OSDI Symposium on Operating Systems Design and Implementation, pp. 137–150 (2004)
Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: MRTuner: a toolkit to enable holistic optimization for MapReduce jobs. Proc. VLDB Endow. 7(13), 1319–1330 (2014)
Georges, A., Kotliar, G., Krauth, W., Rozenberg, M.J.: Dynamical mean-field theory of strongly correlated fermion systems and the limit of infinite dimensions. Rev. Mod. Phys. 68(1), 13–125 (1996)
Pearson, K.: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics, pp. 11–28. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_2
Intel-Hadoop: HiBench-5.0 (2016). https://github.com/intel-hadoop/HiBench. Accessed 09 Oct 2016
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ning, S., Teng, F., Li, Y., Cui, Z., Yu, L., Du, S. (2018). RUPredHadoop: Resources Utilization Predictor for Hadoop with Large-Scale Clusters. In: Xu, Z., Gao, X., Miao, Q., Zhang, Y., Bu, J. (eds) Big Data. Big Data 2018. Communications in Computer and Information Science, vol 945. Springer, Singapore. https://doi.org/10.1007/978-981-13-2922-7_32
Download citation
DOI: https://doi.org/10.1007/978-981-13-2922-7_32
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2921-0
Online ISBN: 978-981-13-2922-7
eBook Packages: Computer ScienceComputer Science (R0)