[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/502512.502522acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Data mining criteria for tree-based regression and classification

Published: 26 August 2001 Publication History

Abstract

This paper is concerned with the construction of regression and classification trees that are more adapted to data mining applications than conventional trees. To this end, we propose new splitting criteria for growing trees. Conventional splitting criteria attempt to perform well on both sides of a split by attempting a compromise in the quality of fit between the left and the right side. By contrast, we adopt a data mining point of view by proposing criteria that search for interesting subsets of the data, as opposed to modeling all of the data equally well. The new criteria do not split based on a compromise between the left and the right bucket; they effectively pick the more interesting bucket and ignore the other.As expected, the result is often a simpler characterization of interesting subsets of the data. Less expected is that the new criteria often yield whole trees that provide more interpretable data descriptions. Surprisingly, it is a "flaw" that works to their advantage: The new criteria have an increased tendency to accept splits near the boundaries of the predictor ranges. This so-called "end-cut problem" leads to the repeated peeling of small layers of data and results in very unbalanced but highly expressive and interpretable trees.

References

[1]
Belsley, D. A., Kuh, E., and Welsch, R. E. (1980), Regression Diagnostics, New York, NY: John Wiley & Sons, Inc.
[2]
Breiman, L. (1996), "Technical Note: Some Properties of Splitting Criteria," Machine Learning, 24, 41-47.
[3]
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984), Classification and Regression Trees (CART), Pacific Grove, CA: Wadsworth.
[4]
Cohen, W. W., and Singer, Y. (1999) "Simple, Fast, and Effective Rule Learner," in: AAAI-99.
[5]
Harrison, R. J., and Rubinfeld, D. L. (1978), "Hedonic Prices and the Demand for Clean Air," Journal of Environmental Economics and Management, 5, 81-102.
[6]
Merz, C. J., and Murphy, P. M. (1998), UCI repository of machine learning data bases (htt p://www.ics.uci.edu/-mlearn/MLRepository.html).
[7]
Qninlan, J. It. (1993), C4.5: Programs for Machine Learning, San Mateo, CA: Morgan Kaufmann.
[8]
ttipley, B. D. (1996), Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.
[9]
StatSci (1995), S-PLUS Guide to Statistical and Mathematical Analysis, Version 3.3, Seattle: MathSoft, Inc.
[10]
Venables, W. N., and Ripley, B. D. (1997), Modern Applied Statistics with" S-Plus, New York, NY: Springer-Verlag.

Cited By

View all
  • (2024)Concise rule induction algorithm based on one-sided maximum decision tree approachExpert Systems with Applications10.1016/j.eswa.2023.121365237(121365)Online publication date: Mar-2024
  • (2023)FAQT-2: A Customer-Oriented Method for MCDM with Statistical Verification Applied to Industrial Robot SelectionExpert Systems with Applications10.1016/j.eswa.2023.120106(120106)Online publication date: Apr-2023
  • (2022)Fuel Loads and Plant Traits as Community‐Level Predictors of Emergent Properties of Vulnerability and Resilience to a Changing Fire Regime in Black Spruce Forests of Boreal AlaskaJournal of Geophysical Research: Biogeosciences10.1029/2021JG006696127:3Online publication date: 16-Mar-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
August 2001
493 pages
ISBN:158113391X
DOI:10.1145/502512
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 August 2001

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Boston Housing data
  2. CART
  3. Pima Indians Diabetes data
  4. splitting criteria

Qualifiers

  • Article

Conference

KDD01
Sponsor:

Acceptance Rates

KDD '01 Paper Acceptance Rate 31 of 237 submissions, 13%;
Overall Acceptance Rate 1,089 of 8,328 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)2
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Concise rule induction algorithm based on one-sided maximum decision tree approachExpert Systems with Applications10.1016/j.eswa.2023.121365237(121365)Online publication date: Mar-2024
  • (2023)FAQT-2: A Customer-Oriented Method for MCDM with Statistical Verification Applied to Industrial Robot SelectionExpert Systems with Applications10.1016/j.eswa.2023.120106(120106)Online publication date: Apr-2023
  • (2022)Fuel Loads and Plant Traits as Community‐Level Predictors of Emergent Properties of Vulnerability and Resilience to a Changing Fire Regime in Black Spruce Forests of Boreal AlaskaJournal of Geophysical Research: Biogeosciences10.1029/2021JG006696127:3Online publication date: 16-Mar-2022
  • (2022)Supervised Machine Learning Approach for Modeling Hot Deformation Behavior of Medium Carbon Steelsteel research international10.1002/srin.20220018894:2Online publication date: 9-Jul-2022
  • (2021)RETRACTED ARTICLE: A swarm-optimized tree-based association rule approach for classifying semi-structured data using soft computing approachSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-021-06158-625:20(12745-12758)Online publication date: 1-Oct-2021
  • (2020)A New Splitting Criterion for Better Interpretable TreesIEEE Access10.1109/ACCESS.2020.29852558(62762-62774)Online publication date: 2020
  • (2020)Contrast trees and distribution boostingProceedings of the National Academy of Sciences10.1073/pnas.1921562117117:35(21175-21184)Online publication date: 19-Aug-2020
  • (2020)Using recursive partitioning to find and estimate heterogenous treatment effects in randomized clinical trialsJournal of Experimental Criminology10.1007/s11292-019-09410-0Online publication date: 5-Mar-2020
  • (2016)An Assessment of the Effectiveness of Tree-Based Models for Multi-Variate Flood Damage Assessment in AustraliaWater10.3390/w80702828:7(282)Online publication date: 9-Jul-2016
  • (2016)Development of Indian Weighted Diabetic Risk Score (IWDRS) using Machine Learning Techniques for Type-2 DiabetesProceedings of the 9th Annual ACM India Conference10.1145/2998476.2998497(125-128)Online publication date: 21-Oct-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media