[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2998476.2998478acmotherconferencesArticle/Chapter ViewAbstractPublication PagescomputeConference Proceedingsconference-collections
research-article

Distributed Decision Tree

Published: 21 October 2016 Publication History

Abstract

Decision Tree is a tree-structured plan of a set of attributes to test in order to predict the output. MapReduce and Spark is a programming model used for processing data on a distributed file system. In this paper, MapReduce and Spark implementation of Decision Tree is named as Distributed Decision Tree (DDT) and Spark Tree (ST) respectively. Decision Tree (DT), Ensemble of Trees (BT), DDT and ST are compared over accuracy, size of tree and number of leaves of tree(s) generated. DDT and ST is empirically evaluated over 10 selected datasets. Using DDT, size of tree is reduced by 71% and 82% as compared to DT and BT respectively. In case of ST size of tree is reduced by 48% and 67% as compared to DT and BT. Number of leaves is reduced by 70% and 81% with respect to DT and BT, respectively using DDT. Whereas, it is reduced by 45% and 65% with respect to DT and BT in case of ST. We evaluated DDT and ST using Yahoo! Webscope dataset. Our evaluation shows improvement in accuracy as well as reduction in size of tree and number of leaves. Hence, DDT and ST outperformed DT and BT with respect to size of tree and number of leaves with adequate classification accuracy.

References

[1]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107--113. DOI=http://dx.doi.org/10.1145/1327452.1327492
[2]
Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5(2), 197--227.
[3]
Rajaraman, A., & Ullman, J. D. (2012). Mining of massive datasets (Vol. 77). Cambridge: Cambridge University Press.
[4]
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.
[5]
Fan, W., Stolfo, S. J., & Zhang, J. (1999, August). The application of AdaBoost for distributed, scalable and on-line learning. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 62--366). ACM.
[6]
Cooper, J., & Reyzin, L. (2014) Improved Algorithms for Distributed Boosting. Neural Information Processing Systems Conference (NIPS 2014 workshop).
[7]
Dai, W., & Ji, W. (2014). A mapreduce implementation of C4. 5 decision tree algorithm. International Journal of Database Theory and Application, 7(1), 49--60.
[8]
Lazarevic, A., & Obradovic, Z. (2002). Boosting algorithms for parallel and distributed learning. Distributed and Parallel Databases, 11(2), 203--229.
[9]
Abualkibash, M., ElSayed, A., & Mahmood, A. (2013). Highly Scalable, Parallel and Distributed AdaBoost Algorithm using Light Weight Threads and Web Services on a Network of Multi-Core Machines. arXiv preprint arXiv:1306.1467.
[10]
Wu, G., Li, H., Hu, X., Bi, Y., Zhang, J., & Wu, X. (2009, August). MReC4. 5: C4. 5 ensemble classification with MapReduce. In ChinaGrid Annual Conference, 2009. ChinaGrid'09. Fourth (pp. 249--255). IEEE.
[11]
Purdila, V., & Pentiuc, S. G. (2014). MR-Tree-A Scalable MapReduce Algorithm for Building Decision Trees. Journal of Applied Computer Science & Mathematics, 16(8), 16--19.
[12]
Biswanath Panda, Joshua S. Herbach, Sugato Basu, and Roberto J. Bayardo. 2009. PLANET: massively parallel learning of tree ensembles with MapReduce. Proc. VLDB Endow. 2, 2 (August 2009), 1426--1437. DOI=http://dx.doi.org/10.14778/1687553.1687569.
[13]
Palit, Indranil; Reddy, Chandan K., "Scalable and Parallel Boosting with MapReduce," in Knowledge and Data Engineering, IEEE Transactions on, vol.24, no.10, pp. 1904--1916, Oct. 2012
[14]
Jerry Ye, Jyh-Herng Chow, Jiang Chen, and Zhaohui Zheng. 2009. Stochastic gradient boosted distributed decision trees. In Proceedings of the 18th ACM conference on Information and knowledge management (CIKM '09). ACM, New York, NY, USA, 2061--2064. DOI=http://dx.doi.org/10.1145/1645953.1646301.
[15]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles (SOSP '03). ACM, New York, NY, USA, 29--43. DOI=http://dx.doi.org/10.1145/945445.945450
[16]
Quinlan, J. R. (1987). Simplifying decision trees. International journal of man-machine studies, 27(3), 221--234.
[17]
Han, J., Kamber, M., & Pei, J. (2011). Data mining: concepts and techniques: concepts and techniques. Elsevier.
[18]
Witten, I. H., & Frank, E. (2005). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.
[19]
Berry, M. J., & Linoff, G. (1997). Data mining techniques: for marketing, sales, and customer support. John Wiley & Sons, Inc.
[20]
Quinlan, J. R. (1990). Decision trees and decision-making. Systems, Man and Cybernetics, IEEE Transactions on, 20(2), 339--346.
[21]
Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of artificial intelligence research, 77--90.
[22]
Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81--106.
[23]
Bowyer, K. W., Hall, L. O., Moore, T., Chawla, N., & Kegelmeyer, W. P. (2000, October). A parallel decision tree builder for mining very large visualization datasets. In Systems, Man, and Cybernetics, 2000 IEEE International Conference on (Vol. 3, pp. 1888--1893). IEEE.
[24]
Shafer, J., Agrawal, R., & Mehta, M. (1996, September). SPRINT: A scalable parallel classi er for data mining. In Proc. 1996 Int. Conf. Very Large Data Bases (pp. 544--555).
[25]
White, T. (2012). Hadoop: The definitive guide." O'Reilly Media, Inc.".
[26]
Yahoo! Webscope dataset ydata-frontpage-todaymodule-clicksv1_0 {http://labs.yahoo.com/Academic_Relations}.

Cited By

View all
  • (2023)Feasibility of Using Machine Learning Algorithms for Yield Prediction of Corn and Sunflower Crops Based on Seeding DateStudia Universitatis Babeș-Bolyai Informatica10.24193/subbi.2022.2.0267:2(21-36)Online publication date: 2-Jun-2023
  • (2023)Big data decision tree for continuous-valued attributes based on unbalanced cut pointsJournal of Big Data10.1186/s40537-023-00816-210:1Online publication date: 31-Aug-2023
  • (2023)Collaborative Machine Learning: Schemes, Robustness, and PrivacyIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.316934734:12(9625-9642)Online publication date: Dec-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
COMPUTE '16: Proceedings of the 9th Annual ACM India Conference
October 2016
178 pages
ISBN:9781450348089
DOI:10.1145/2998476
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Classification
  2. Decision Tree
  3. Distributed Decistion Tree
  4. Hadoop Map-Reduce
  5. Spark

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ACM COMPUTE '16
ACM COMPUTE '16: Ninth Annual ACM India Conference
October 21 - 23, 2016
Gandhinagar, India

Acceptance Rates

COMPUTE '16 Paper Acceptance Rate 22 of 117 submissions, 19%;
Overall Acceptance Rate 114 of 622 submissions, 18%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)2
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Feasibility of Using Machine Learning Algorithms for Yield Prediction of Corn and Sunflower Crops Based on Seeding DateStudia Universitatis Babeș-Bolyai Informatica10.24193/subbi.2022.2.0267:2(21-36)Online publication date: 2-Jun-2023
  • (2023)Big data decision tree for continuous-valued attributes based on unbalanced cut pointsJournal of Big Data10.1186/s40537-023-00816-210:1Online publication date: 31-Aug-2023
  • (2023)Collaborative Machine Learning: Schemes, Robustness, and PrivacyIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.316934734:12(9625-9642)Online publication date: Dec-2023
  • (2022)A Super Ensembled and Traditional Models for the Prediction of Rainfall: An Experimental Evaluation of DT Versus DDT Versus RFCommunication and Intelligent Systems10.1007/978-981-19-2130-8_48(619-635)Online publication date: 19-Aug-2022
  • (2021)Time-Efficient Ensemble Learning with Sample Exchange for Edge ComputingACM Transactions on Internet Technology10.1145/340926521:3(1-17)Online publication date: 16-Jun-2021
  • (2021)To Ameliorate Classification Accuracy using Ensemble Distributed Decision Tree (DDT) Vote Approach: An Empirical discourse of Geographical Data MiningProcedia Computer Science10.1016/j.procs.2021.03.116184(935-940)Online publication date: 2021
  • (2019)Accuracy Prediction for Distributed Decision Tree using Machine Learning approach2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI)10.1109/ICOEI.2019.8862580(1365-1371)Online publication date: Apr-2019
  • (2017)Distributed decision tree v.2.02017 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2017.8258011(929-934)Online publication date: Dec-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media