More Web Proxy on the site http://driver.im/

research-article

Training Big Random Forests with Little Resources

Authors:

Fabian Gieseke,

Christian IgelAuthors Info & Claims

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1445 - 1454

https://doi.org/10.1145/3219819.3220124

Published: 19 July 2018 Publication History

Abstract

Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.

Supplementary Material

MP4 File (gieseke_training_forests.mp4)

Download
363.41 MB

References

[1]

David M. Beazley. 1996. SWIG: An Easy to Use Tool for Integrating Scripting Languages with C and C ++. In Proceedings of the 4th Conference on USENIX Tcl/Tk Workshop, 1996 - Volume 4. USENIX Association, Berkeley, CA, USA, 15--15.

Digital Library

[2]

Jon Louis Bentley . 1975. Multidimensional Binary Search Trees Used For Associative Searching. Commun. ACM Vol. 18, 9 (1975), 509--517.

Digital Library

[3]

Leo Breiman . 2001. Random Forests. Machine Learning Vol. 45, 1 (2001), 5--32.

Digital Library

[4]

Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer . 2008. Fast Support Vector Machine Training and Classification on Graphics Processors. In Proceedings of the 25th International Conference on Machine Learning. ACM, New York, NY, USA, 104--111.

Digital Library

[5]

Tianqi Chen and Carlos Guestrin . 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 785--794.

Digital Library

[6]

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Bryan C. Catanzaro, and Andrew Y. Ng . 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning. JMLR.org, 1337--1345.

Digital Library

[7]

Jeffrey Dean and Sanjay Ghemawat . 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM Vol. 51, 1 (2008), 107--113.

Digital Library

[8]

Sara del Río, Victoria López, José Manuel Benítez, and Francisco Herrera . 2014. On the use of MapReduce for imbalanced big data using Random Forest. Information Sciences Vol. 285 (2014), 112--137.

Digital Library

[9]

Dua Dheeru and Efi Karra Taniskidou . 2017. UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). University of California, Irvine, School of Information and Computer Sciences.

[10]

Brian Van Essen, Chris Macaraeg, Maya Gokhale, and Ryan Prenger . 2012. Accelerating a Random Forest Classifier: Multi-Core, GP-GPU, or FPGA? 2012 IEEE 20th Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE Computer Society, Washington, DC, USA, 232--239.

Digital Library

[11]

Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim . 2014. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research Vol. 15 (2014), 3133--3181.

Digital Library

[12]

Robin Genuer, Jean-Michel Poggi, Christine Tuleau-Malot, and Nathalie Villa-Vialaneix . 2015. Random Forests for Big Data. eprint arXiv:1511.08327v1 (2015).

[13]

Pierre Geurts, Damien Ernst, and Louis Wehenkel . 2006. Extremely randomized trees. Machine Learning Vol. 63, 1 (2006), 3--42.

Digital Library

[14]

Fabian Gieseke, Justin Heinermann, Cosmin Oancea, and Christian Igel . 2014. Buffer k-d Trees: Processing Massive Nearest Neighbor Queries on GPUs Proc. of the 31st International Conf. on Machine Learning (JMLR W&CP), Vol. Vol. 32. JMLR.org, 172--180.

Digital Library

[15]

Håkan Grahn, Niklas Lavesson, Mikael Hellborg Lapajne, and Daniel Slat . 2011. CudaRF: A CUDA-based implementation of Random Forests The 9th IEEE/ACS International Conference on Computer Systems and Applications. IEEE Computer Society, Washington, DC, USA, 95--101.

Digital Library

[16]

Trevor Hastie, Robert Tibshirani, and Jerome Friedman . 2009. The Elements of Statistical Learning. Springer.

[17]

Christian Igel, Verena Heidrich-Meisner, and Tobias Glasmachers . 2008. Shark. Journal of Machine Learning Research Vol. 9 (2008), 993--996.

Digital Library

[18]

Karl Jansson, Håkan Sundell, and Henrik Boström . 2014. gpuRF and gpuERT: Efficient and Scalable GPU Algorithms for Decision Tree Ensembles 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. IEEE Computer Society, 1612--1621.

Digital Library

[19]

Eric Jones, Travis Oliphant, Pearu Peterson, et almbox. . 2001--2018. SciPy: Open source scientific tools for Python. http://www.scipy.org.

[20]

Balaji Lakshminarayanan, Daniel M. Roy, and Yee Whye Teh . 2014. Mondrian Forests: Efficient Online Random Forests. In Advances in Neural Information Processing Systems. 3140--3148.

Digital Library

[21]

Gilles Louppe . 2014. Understanding Random Forests. Ph.D. Dissertation. bibinfoschoolUniversity of Liège, Faculty of Applied Sciences, Department of Electrical Engineering & Computer Science.

[22]

Gilles Louppe and Pierre Geurts . 2012. Ensembles on Random Patches. In Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I. Springer-Verlag, 346--361.

[23]

Kevin P. Murphy . 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.

Digital Library

[24]

OpenStreetMap contributors . 2018. Planet dump retrieved from https://planet.osm.org . https://www.openstreetmap.org .

[25]

Biswanath Panda, Joshua S. Herbach, Sugato Basu, and Roberto J. Bayardo . 2009. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce Proceedings of the 35th International Conference on Very Large Data Bases. VLDB Endowment, 1426--1437.

Digital Library

[26]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard D Duchesnay . 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research Vol. 12 (2011), 2825--2830.

Digital Library

[27]

Toby Sharp . 2008. Implementing Decision Trees and Forests on a GPU Computer Vision - ECCV 2008, 10th European Conference on Computer Vision (Lecture Notes in Computer Science), Vol. Vol. 5305. Springer, 595--608.

[28]

The HDF Group . 2000--2017. Hierarchical data format version 5. deftempurl%http://www.hdfgroup.org/HDF5 tempurl

[29]

Zeyi Wen, Rui Zhang, Kotagiri Ramamohanarao, Jianzhong Qi, and Kerry Taylor . 2014. MASCOT: Fast and Highly Scalable SVM Cross-validation using GPUs and SSDs. In Proceedings of the 2014 IEEE International Conference on Data Mining. 580--589.

Digital Library

[30]

Michael A. Wulder, Jeffrey G. Masek, Warren B. Cohen, Thomas R. Loveland, and Curtis E. Woodcock . 2012. Opening the archive: How free data has enabled the science and monitoring promise of Landsat. Remote Sensing of Environment Vol. 122, Supplement C (2012), 2 -- 10. Landsat Legacy Special Issue.

Cited By

Tomaras DBuschjäger SKalogeraki VMorik KGunopulos DSchiavoni VEdinger JCao JJin Z(2024)STRATA: Random Forests going ServerlessProceedings of the 25th International Middleware Conference10.1145/3652892.3654791(22-35)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3652892.3654791
Li AGuo SLi C(2024)An Artificial Lateral Line-Based Active Obstacle Recognition Strategy and Performance Evaluation for Bionic Underwater RobotsIEEE Sensors Journal10.1109/JSEN.2024.341780924:16(26266-26277)Online publication date: 15-Aug-2024
https://doi.org/10.1109/JSEN.2024.3417809
Gudla RVollala SK.G. SAmin R(2024)A novel approach for classification of Tor and non-Tor traffic using efficient feature selection methodsExpert Systems with Applications10.1016/j.eswa.2024.123544(123544)Online publication date: Mar-2024
https://doi.org/10.1016/j.eswa.2024.123544
Show More Cited By

Index Terms

Training Big Random Forests with Little Resources
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Supervised heterogeneous transfer learning using random forests
CODS-COMAD '18: Proceedings of the ACM India Joint International Conference on Data Science and Management of Data

Supervised transfer learning algorithms utilize labeled data from auxiliary domains for learning in another domain where labeled data is scarce or absent. Given sufficient cross-domain corresponding instances, one can learn a robust transformation that ...
On the asymptotics of random forests

The last decade has witnessed a growing interest in random forest models which are recognized to exhibit good practical performance, especially in high-dimensional settings. On the theoretical side, however, their predictive power remains largely ...
Selective sampling for trees and forests
Abstract
In this paper we describe selective sampling algorithms for decision trees and random forests and their contribution to the classification accuracy. In our selective sampling algorithms, the instance that yields the highest expected ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2018

2925 pages

ISBN:9781450355520

DOI:10.1145/3219819

General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '18

Sponsor:

KDD '18: The 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 19 - 23, 2018

London, United Kingdom

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
719
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)1

Reflects downloads up to 27 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tomaras DBuschjäger SKalogeraki VMorik KGunopulos DSchiavoni VEdinger JCao JJin Z(2024)STRATA: Random Forests going ServerlessProceedings of the 25th International Middleware Conference10.1145/3652892.3654791(22-35)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3652892.3654791
Li AGuo SLi C(2024)An Artificial Lateral Line-Based Active Obstacle Recognition Strategy and Performance Evaluation for Bionic Underwater RobotsIEEE Sensors Journal10.1109/JSEN.2024.341780924:16(26266-26277)Online publication date: 15-Aug-2024
https://doi.org/10.1109/JSEN.2024.3417809
Gudla RVollala SK.G. SAmin R(2024)A novel approach for classification of Tor and non-Tor traffic using efficient feature selection methodsExpert Systems with Applications10.1016/j.eswa.2024.123544(123544)Online publication date: Mar-2024
https://doi.org/10.1016/j.eswa.2024.123544
He JDuan WZhou YSu Y(2024)Impact of Media Information on Social Response in Disasters: A Case Study of the Freezing-Rain and Snowstorm Disasters in Southern China in 2008International Journal of Disaster Risk Science10.1007/s13753-024-00539-915:1(73-87)Online publication date: 22-Feb-2024
https://doi.org/10.1007/s13753-024-00539-9
Cai CLi XHe GLian FLi MWang QQin Y(2023)Inverse design of dual-band photonic topological insulator beam splitters for efficient light transmissionJournal of Physics D: Applied Physics10.1088/1361-6463/ad14b857:13(135301)Online publication date: 29-Dec-2023
https://doi.org/10.1088/1361-6463/ad14b8
Yu SYang JHuang TZhu JVisco CHameed FStein JZhou XSu H(2023)Artificial Neural Network-Based Activities Classification, Gait Phase Estimation, and PredictionAnnals of Biomedical Engineering10.1007/s10439-023-03151-y51:7(1471-1484)Online publication date: 21-Jan-2023
https://doi.org/10.1007/s10439-023-03151-y
Niehage MRemke A(2022)Learning that Grid-Convenience Does Not Hurt Resilience in the Presence of UncertaintyFormal Modeling and Analysis of Timed Systems10.1007/978-3-031-15839-1_17(298-306)Online publication date: 29-Aug-2022
https://doi.org/10.1007/978-3-031-15839-1_17
Madhyastha MLillaney KBrowne JVogelstein JBurns RZhu FChin Ooi BMiao CWang HSkrypnyk IHsu WChawla S(2021)BLOCKSET (Block-Aligned Serialized Trees)Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467368(1170-1179)Online publication date: 14-Aug-2021
https://dl.acm.org/doi/10.1145/3447548.3467368
Jørgensen RIgel C(2021)Machine learning for financial transaction classification across companies using character‐level word embeddings of text fieldsInternational Journal of Intelligent Systems in Accounting and Finance Management10.1002/isaf.150028:3(159-172)Online publication date: 28-Oct-2021
https://dl.acm.org/doi/10.1002/isaf.1500
Lorenzen SIgel CSeldin Y(2019)On PAC-Bayesian bounds for random forestsMachine Learning10.1007/s10994-019-05803-4Online publication date: 13-May-2019
https://doi.org/10.1007/s10994-019-05803-4

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents