Abstract
Apache Spark is the most popular open-source framework today that uses an in-memory-oriented abstraction Resilient Distributed Dataset (RDD) to process large-scale data. Recently, research work on performance prediction and optimization for Spark platform continues to increase rapidly. However, selecting important configuration parameters in most wok is always dependent on the experience of domain experts yet. Therefore, configuration parameters selection based on machine learning algorithms is a non-trivial research issue. In this paper, a method based on machine learning to identify Spark important parameters ISIP is proposed. By providing a relatively important subset of configuration parameters, the parameter space for performance tuning on Spark can be reduced, thereby saving the time and effort of users or researchers. ISIP uses Mean-shift algorithm to cluster the applications based on the workload characteristics of the applications from Spark MLlib. Then the relationship between the performance and the configuration parameters is modeled by Regression Algorithm. In the meanwhile, the ranked list of parameters by their importance is provided respectively for each type of applications. The subset of most important configuration parameters consists of the parameters at the front of the list. The experimental results show that the effect of adjusting the subset of relatively important configuration parameters provided by ISIP is almost the same as the complete parameters set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Duan, S., Thummala, V., Babu, S.: Tuning database configuration parameters with iTuned. Proc. VLDB Endow. 2(1), 1246–1257 (2009)
Apache Spark. https://spark.apache.org
Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: 2015 IEEE International Conference on High PERFORMANCE Computing and Communications and 2015 IEEE International Symposium on Cyberspace Safety and Security and International Conference on Embedded Software and Systems, pp. 166–173. IEEE Computer Society (2015)
Wang, G., Xu, J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: IEEE International Conference on High PERFORMANCE Computing and Communications and IEEE International Conference on Smart City and IEEE International Conference on Data Science and Systems, pp. 586–593. IEEE (2017)
Aken, D.V., Pavlo, A., Gordon, G.J., et al.: Automatic database management system tuning through large-scale machine learning. In: ACM International Conference on Management of Data, pp. 1009–1024. ACM (2017)
Zaharia, M., Chowdhury, M., Das, T, et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Usenix Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012)
Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on Apache Spark. In: IEEE International Symposium on PERFORMANCE Analysis of Systems and Software, pp. 112–121. IEEE (2016)
Driscoll, P., Lecky, F., Crosby, M.: An introduction to statistics. 30(10), 540 (2000)
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974)
Sklearn. http://scikit-learn.org
Feizollah, A., Anuar, N.B., Salleh, R., et al.: Comparative study of k-means and mini batch k-means clustering algorithms in android malware detection using network traffic analysis. In: International Symposium on Biometrics and Security Technologies, pp. 193–197. IEEE (2015)
Newling, J., Fleuret, F.: Nested mini-batch K-means (2016)
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. 58(1), 267–288 (1996)
Wardjr, J.: Hierarchical grouping to optimize an objective function. Publ. Am. Stat. Assoc. 58(301), 236–244 (1963)
Szekely, G.J., Rizzo, M.L.: Hierarchical clustering via joint between-within distances: extending ward’s minimum variance method. J. Classif. 22(2), 151–183 (2005)
Hastie, T., Tibshirani, R., Friedman, J.H., et al.: The Elements of Statistical Learning. World Publishing Corporation, New York (2015)
Acknowledgement
This work is supported by the National Key Research and Development Program under No. 2016YFB1000703.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Li, T., Shi, S., Luo, J., Wang, H. (2018). A Method to Identify Spark Important Parameters Based on Machine Learning. In: Zhou, Q., Gan, Y., Jing, W., Song, X., Wang, Y., Lu, Z. (eds) Data Science. ICPCSEE 2018. Communications in Computer and Information Science, vol 901. Springer, Singapore. https://doi.org/10.1007/978-981-13-2203-7_42
Download citation
DOI: https://doi.org/10.1007/978-981-13-2203-7_42
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2202-0
Online ISBN: 978-981-13-2203-7
eBook Packages: Computer ScienceComputer Science (R0)