Abstract
Random Forest is a computationally efficient technique that can operate quickly over large datasets. It has been used in many recent research projects and real-world applications in diverse domains. However, the associated literature provides few information about what happens in the trees within a Random Forest. The research reported here analyzes the frequency that an attribute appears in the root node in a Random Forest in order to find out if it uses all attributes with equal frequency or if there is some of them most used. Additionally, we have also analyzed the estimated out-of-bag error of the trees aiming to check if the most used attributes present a good performance. Furthermore, we have analyzed if the use of pre-pruning could influence the performance of the Random Forest using out-of-bag errors. Our main conclusions are that the frequency of the attributes in the root node has an exponential behavior. In addition, the use of the estimated out-of-bag error can help to find relevant attributes within the forest. Concerning to the use of pre-pruning, it was observed the execution time can be faster, without significant loss of performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cancer program data sets. Broad Institute (2010), http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
Dataset repository in arff (weka). BioInformatics Group Seville (2010), http://www.upo.es/eps/bigs/datasets.html
Datasets. Cilab (2010), http://cilab.ujn.edu.cn/datasets.htm
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57, 289–300 (1995)
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
Costa, P.R., Acencio, M.L., Lemke, N.: A machine learning approach for genome-wide prediction of morbid and druggable human genes based on systems-level data. BMC Genomics 11(suppl. 5) (2010)
Demšar, J.: Statistical comparison of classifiers over multiple data sets. Journal of Machine Learning Research 7(1), 1–30 (2006)
Dubath, P., Rimoldini, L., Süveges, M., Blomme, J., López, M., Sarro, L.M., De Ridder, J., Cuypers, J., Guy, L., Lecoeur, I., Nienartowicz, K., Jan, A., Beck, M., Mowlavi, N., De Cat, P., Lebzelter, T., Eyer, L.: Random forest automated supervised classification of hipparcos periodic variable stars. Monthly Notices of the Royal Astronomical Society 414(3), 2602–2617 (2011), http://dx.doi.org/10.1111/j.1365-2966.2011.18575.x
Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml
Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning, pp. 123–140. Morgan Kaufmann, Lake Tahoe (1996)
Friedman, M.: A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics 11(1), 86–92 (1940)
Goldstein, B., Hubbard, A., Cutler, A., Barcellos, L.: An application of random forests to a genome-wide association dataset: Methodological considerations and new findings. BMC Genetics 11(1), 49 (2010), http://www.biomedcentral.com/1471-2156/11/49
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. Association for Computing Machinery’s Special Interest Group on Knowledge Discovery and Data Mining Explor. Newsl. 11(1), 10–18 (2009)
Lee, J.-H., Kim, D.-Y., Ko, B.C., Nam, J.-Y.: Keyword annotation of medical image with random forest classifier and confidence assigning. In: International Conference on Computer Graphics, Imaging and Visualization, pp. 156–159 (2011)
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 1–5 (2002), http://CRAN.R-project.org/doc/Rnews/
Netto, O.P., Nozawa, S.R., Mitrowsky, R.A.R., Macedo, A.A., Baranauskas, J.A.:
Oh, I.S., Lee, J.S., Moon, B.R.: Hybrid genetic algorithms for feature selection. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1424–1437 (2004)
Oshiro, T.M., Perez, P.S., Baranauskas, J.A.: How Many Trees in a Random Forest? In: Perner, P. (ed.) MLDM 2012. LNCS, vol. 7376, pp. 154–168. Springer, Heidelberg (2012)
Saeys, Y., Inza, I.n., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517 (2007)
Sirikulviriya, N., Sinthupinyo, S.: Integration of rules from a random forest. In: International Conference on Information and Electronics Engineering, vol. 6, pp. 194–198 (2011)
Tang, Y.: Real-Time Automatic Face Tracking Using Adaptive Random Forests. Master’s thesis, Department of Electrical and Computer Engineering, McGill University, Montreal, Canada (June 2010)
Wang, G., Hao, J., Ma, J., Jiang, H.: A comparative assessment of ensemble learning for credit scoring. Expert Systems with Applications 38, 223–230 (2011)
Yaqub, M., Javaid, M.K., Cooper, C., Noble, J.A.: Improving the Classification Accuracy of the Classic RF Method by Intelligent Feature Selection and Weighted Voting of Trees with Application to Medical Image Segmentation. In: Suzuki, K., Wang, F., Shen, D., Yan, P. (eds.) MLMI 2011. LNCS, vol. 7009, pp. 184–192. Springer, Heidelberg (2011)
Zhao, Y., Zhang, Y.: Comparison of decision tree methods for finding active objects. Advances in Space Research 41, 1955–1959 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Oshiro, T.M., Baranauskas, J.A. (2012). Root Attribute Behavior within a Random Forest. In: Yin, H., Costa, J.A.F., Barreto, G. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2012. IDEAL 2012. Lecture Notes in Computer Science, vol 7435. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32639-4_87
Download citation
DOI: https://doi.org/10.1007/978-3-642-32639-4_87
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32638-7
Online ISBN: 978-3-642-32639-4
eBook Packages: Computer ScienceComputer Science (R0)