[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3219819.3220124acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Training Big Random Forests with Little Resources

Published: 19 July 2018 Publication History

Abstract

Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.

Supplementary Material

MP4 File (gieseke_training_forests.mp4)

References

[1]
David M. Beazley. 1996. SWIG: An Easy to Use Tool for Integrating Scripting Languages with C and C ++. In Proceedings of the 4th Conference on USENIX Tcl/Tk Workshop, 1996 - Volume 4. USENIX Association, Berkeley, CA, USA, 15--15.
[2]
Jon Louis Bentley . 1975. Multidimensional Binary Search Trees Used For Associative Searching. Commun. ACM Vol. 18, 9 (1975), 509--517.
[3]
Leo Breiman . 2001. Random Forests. Machine Learning Vol. 45, 1 (2001), 5--32.
[4]
Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer . 2008. Fast Support Vector Machine Training and Classification on Graphics Processors. In Proceedings of the 25th International Conference on Machine Learning. ACM, New York, NY, USA, 104--111.
[5]
Tianqi Chen and Carlos Guestrin . 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 785--794.
[6]
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Bryan C. Catanzaro, and Andrew Y. Ng . 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning. JMLR.org, 1337--1345.
[7]
Jeffrey Dean and Sanjay Ghemawat . 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM Vol. 51, 1 (2008), 107--113.
[8]
Sara del Río, Victoria López, José Manuel Benítez, and Francisco Herrera . 2014. On the use of MapReduce for imbalanced big data using Random Forest. Information Sciences Vol. 285 (2014), 112--137.
[9]
Dua Dheeru and Efi Karra Taniskidou . 2017. UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). University of California, Irvine, School of Information and Computer Sciences.
[10]
Brian Van Essen, Chris Macaraeg, Maya Gokhale, and Ryan Prenger . 2012. Accelerating a Random Forest Classifier: Multi-Core, GP-GPU, or FPGA? 2012 IEEE 20th Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE Computer Society, Washington, DC, USA, 232--239.
[11]
Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim . 2014. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research Vol. 15 (2014), 3133--3181.
[12]
Robin Genuer, Jean-Michel Poggi, Christine Tuleau-Malot, and Nathalie Villa-Vialaneix . 2015. Random Forests for Big Data. eprint arXiv:1511.08327v1 (2015).
[13]
Pierre Geurts, Damien Ernst, and Louis Wehenkel . 2006. Extremely randomized trees. Machine Learning Vol. 63, 1 (2006), 3--42.
[14]
Fabian Gieseke, Justin Heinermann, Cosmin Oancea, and Christian Igel . 2014. Buffer k-d Trees: Processing Massive Nearest Neighbor Queries on GPUs Proc. of the 31st International Conf. on Machine Learning (JMLR W&CP), Vol. Vol. 32. JMLR.org, 172--180.
[15]
Håkan Grahn, Niklas Lavesson, Mikael Hellborg Lapajne, and Daniel Slat . 2011. CudaRF: A CUDA-based implementation of Random Forests The 9th IEEE/ACS International Conference on Computer Systems and Applications. IEEE Computer Society, Washington, DC, USA, 95--101.
[16]
Trevor Hastie, Robert Tibshirani, and Jerome Friedman . 2009. The Elements of Statistical Learning. Springer.
[17]
Christian Igel, Verena Heidrich-Meisner, and Tobias Glasmachers . 2008. Shark. Journal of Machine Learning Research Vol. 9 (2008), 993--996.
[18]
Karl Jansson, Håkan Sundell, and Henrik Boström . 2014. gpuRF and gpuERT: Efficient and Scalable GPU Algorithms for Decision Tree Ensembles 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. IEEE Computer Society, 1612--1621.
[19]
Eric Jones, Travis Oliphant, Pearu Peterson, et almbox. . 2001--2018. SciPy: Open source scientific tools for Python. http://www.scipy.org.
[20]
Balaji Lakshminarayanan, Daniel M. Roy, and Yee Whye Teh . 2014. Mondrian Forests: Efficient Online Random Forests. In Advances in Neural Information Processing Systems. 3140--3148.
[21]
Gilles Louppe . 2014. Understanding Random Forests. Ph.D. Dissertation. bibinfoschoolUniversity of Liège, Faculty of Applied Sciences, Department of Electrical Engineering & Computer Science.
[22]
Gilles Louppe and Pierre Geurts . 2012. Ensembles on Random Patches. In Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I. Springer-Verlag, 346--361.
[23]
Kevin P. Murphy . 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.
[24]
OpenStreetMap contributors . 2018. Planet dump retrieved from https://planet.osm.org . https://www.openstreetmap.org .
[25]
Biswanath Panda, Joshua S. Herbach, Sugato Basu, and Roberto J. Bayardo . 2009. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce Proceedings of the 35th International Conference on Very Large Data Bases. VLDB Endowment, 1426--1437.
[26]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard D Duchesnay . 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research Vol. 12 (2011), 2825--2830.
[27]
Toby Sharp . 2008. Implementing Decision Trees and Forests on a GPU Computer Vision - ECCV 2008, 10th European Conference on Computer Vision (Lecture Notes in Computer Science), Vol. Vol. 5305. Springer, 595--608.
[28]
The HDF Group . 2000--2017. Hierarchical data format version 5. deftempurl%http://www.hdfgroup.org/HDF5 tempurl
[29]
Zeyi Wen, Rui Zhang, Kotagiri Ramamohanarao, Jianzhong Qi, and Kerry Taylor . 2014. MASCOT: Fast and Highly Scalable SVM Cross-validation using GPUs and SSDs. In Proceedings of the 2014 IEEE International Conference on Data Mining. 580--589.
[30]
Michael A. Wulder, Jeffrey G. Masek, Warren B. Cohen, Thomas R. Loveland, and Curtis E. Woodcock . 2012. Opening the archive: How free data has enabled the science and monitoring promise of Landsat. Remote Sensing of Environment Vol. 122, Supplement C (2012), 2 -- 10. Landsat Legacy Special Issue.

Cited By

View all
  • (2024)STRATA: Random Forests going ServerlessProceedings of the 25th International Middleware Conference10.1145/3652892.3654791(22-35)Online publication date: 2-Dec-2024
  • (2024)An Artificial Lateral Line-Based Active Obstacle Recognition Strategy and Performance Evaluation for Bionic Underwater RobotsIEEE Sensors Journal10.1109/JSEN.2024.341780924:16(26266-26277)Online publication date: 15-Aug-2024
  • (2024)A novel approach for classification of Tor and non-Tor traffic using efficient feature selection methodsExpert Systems with Applications10.1016/j.eswa.2024.123544(123544)Online publication date: Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. classification and regression trees
  2. ensemble methods
  3. large-scale data analytics
  4. machine learning
  5. random forests
  6. supervised learning

Qualifiers

  • Research-article

Conference

KDD '18
Sponsor:

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)1
Reflects downloads up to 27 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)STRATA: Random Forests going ServerlessProceedings of the 25th International Middleware Conference10.1145/3652892.3654791(22-35)Online publication date: 2-Dec-2024
  • (2024)An Artificial Lateral Line-Based Active Obstacle Recognition Strategy and Performance Evaluation for Bionic Underwater RobotsIEEE Sensors Journal10.1109/JSEN.2024.341780924:16(26266-26277)Online publication date: 15-Aug-2024
  • (2024)A novel approach for classification of Tor and non-Tor traffic using efficient feature selection methodsExpert Systems with Applications10.1016/j.eswa.2024.123544(123544)Online publication date: Mar-2024
  • (2024)Impact of Media Information on Social Response in Disasters: A Case Study of the Freezing-Rain and Snowstorm Disasters in Southern China in 2008International Journal of Disaster Risk Science10.1007/s13753-024-00539-915:1(73-87)Online publication date: 22-Feb-2024
  • (2023)Inverse design of dual-band photonic topological insulator beam splitters for efficient light transmissionJournal of Physics D: Applied Physics10.1088/1361-6463/ad14b857:13(135301)Online publication date: 29-Dec-2023
  • (2023)Artificial Neural Network-Based Activities Classification, Gait Phase Estimation, and PredictionAnnals of Biomedical Engineering10.1007/s10439-023-03151-y51:7(1471-1484)Online publication date: 21-Jan-2023
  • (2022)Learning that Grid-Convenience Does Not Hurt Resilience in the Presence of UncertaintyFormal Modeling and Analysis of Timed Systems10.1007/978-3-031-15839-1_17(298-306)Online publication date: 29-Aug-2022
  • (2021)BLOCKSET (Block-Aligned Serialized Trees)Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467368(1170-1179)Online publication date: 14-Aug-2021
  • (2021)Machine learning for financial transaction classification across companies using character‐level word embeddings of text fieldsInternational Journal of Intelligent Systems in Accounting and Finance Management10.1002/isaf.150028:3(159-172)Online publication date: 28-Oct-2021
  • (2019)On PAC-Bayesian bounds for random forestsMachine Learning10.1007/s10994-019-05803-4Online publication date: 13-May-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media