Abstract
In the last years, there has been a significant increment in the quantity of data available and computational resources. This leads scientific and industry communities to pursue more accurate and efficient Machine Learning (ML) models. Random Forest is a well-known algorithm in the ML field due to the good results obtained in a wide range of problems. Our objective is to create a parallel version of the algorithm that can generate a model using data distributed across different processors that computationally scales on available resources. This paper presents two novel proposals for this algorithm with a data-parallel approach. The first version is implemented using the PyCOMPSs framework and its failure management mechanism, while the second variant uses the new PyCOMPSs nesting paradigm where the parallel tasks can generate other tasks within them. Both approaches are compared between them and against MLlib Apache Spark Random Forest with strong and weak scaling tests. Our findings indicate that while the MLlib implementation is faster when executed in a small number of nodes, the scalability of both new variants is superior. We conclude that the proposed data-parallel approaches to the Random Forest algorithm can effectively generate accurate and efficient models in a distributed computing environment and offer improved scalability over existing methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Azizah, N., Riza, L.S., Wihardi, Y.: Implementation of random forest algorithm with parallel computing in r. J. Phys: Conf. Ser. 1280(2), 022028 (2019). https://doi.org/10.1088/1742-6596/1280/2/022028
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nature Commun. 5(1), 4308 (2014)
Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. J. Mach. Learn. Res. 11(2) (2010)
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Cart: Classification and Regression Trees (1984). Wadsworth, Belmont, CA (1993)
Chen, J., et al.: A parallel random forest algorithm for big data in a spark cloud computing environment. IEEE Trans. Parallel Distrib. Syst. 28(4), 919–933 (2016)
Cid-Fuentes, J.Á., Solà, S., Álvarez, P., Castro-Ginard, A., Badia, R.M.: dislib: Large scale high performance machine learning in python. In: 2019 15th International Conference on eScience (eScience), pp. 96–105. IEEE (2019)
Ejarque, J., Bertran, M., Cid-Fuentes, J.Á., Conejero, J., Badia, R.M.: Managing failures in task-based parallel workflows in distributed computing environments. In: Malawski, M., Rzadca, K. (eds.) Euro-Par 2020. LNCS, vol. 12247, pp. 411–425. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57675-2_26
Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995)
Lordan, F., et al.: ServiceSs: an interoperable programming framework for the cloud. J. Grid Comput. 12(1), 67–91 (2013). https://doi.org/10.1007/s10723-013-9272-5
Lordan, F., Lezzi, D., Badia, R.M.: Colony: parallel functions as a service on the cloud-edge continuum. In: Sousa, L., Roma, N., Tomás, P. (eds.) Euro-Par 2021. LNCS, vol. 12820, pp. 269–284. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85665-6_17
Meng, X., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling. In: Proceedings of the 14th Python in Science Conference, no. 130–136. Citeseer (2015)
Salzberg, S.L.: C4. 5: Programs for machine learning by j. ross quinlan. morgan kaufmann publishers, inc., 1993 (1994)
Tejedor, E., et al.: Pycompss: parallel computational workflows in python. Int. J. High Perform. Comput. Appl. 31(1), 66–82 (2017)
Van Rossum, G., Drake, F.L.: Python 3 Reference Manual. CreateSpace, Scotts Valley, CA (2009)
Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664
Acknowledgements
This work has been supported by the Spanish Government (PID2019-107255GB) and by the MCIN/AEI /10.13039/501100011033 (CEX2021- 001148-S), by the Departament de Recerca i Universitats de la Generalitat de Catalunya to the Research Group MPiEDist (2021 SGR 00412), and by the European Commission’s Horizon 2020 Framework program and the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558 and by MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR (PCI2021-121957, project eFlows4HPC), and by the European Commission through the Horizon Europe Research and Innovation program under Grant Agreement No. 101016577 (AI-Sprint project).
We thank Núria Masclans and Lluís Jofre from the Department of Fluid Mechanics of the Universitat Politècnica de Catalunya for providing the High Pressure Turbulence dataset.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Vázquez-Novoa, F., Conejero, J., Tatu, C., Badia, R.M. (2023). Scalable Random Forest with Data-Parallel Computing. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-39698-4_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39697-7
Online ISBN: 978-3-031-39698-4
eBook Packages: Computer ScienceComputer Science (R0)