[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Learning when training data are costly: the effect of class distribution on tree induction

Published: 01 October 2003 Publication History

Abstract

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples and/or the computational costs associated with learning from them. In such circumstances, one question of practical importance is: if only n training examples can be selected, in what proportion should the classes be represented? In this article we help to answer this question by analyzing, for a fixed training-set size, the relationship between the class distribution of the training data and the performance of classification trees induced from these data. We study twenty-six data sets and, for each, determine the best class distribution for learning. The naturally occurring class distribution is shown to generally perform well when classifier performance is evaluated using undifferentiated error rate (0/1 loss). However, when the area under the ROC curve is used to evaluate classifier performance, a balanced distribution is shown to perform well. Since neither of these choices for class distribution always generates the best-performing classifier, we introduce a "budget-sensitive" progressive sampling algorithm for selecting training examples based on the class associated with each example. An empirical analysis of this algorithm shows that the class distribution of the resulting training set yields classifiers with good (nearly-optimal) classification performance.

References

[1]
Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning, 36, 105-139.
[2]
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Belmont, CA: Wadsworth International Group.
[3]
Blake, C., & Merz, C. (1998). UCI Repository of Machine Learning Databases, (http://www.ics.uci.edu/~mlearn/MLRepository.html), Department of Computer Science, University of California.
[4]
Bradley, A. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145-1159.
[5]
Bradford, J.P., Kunz, C., Kohavi, R., Brunk, C., & Brodley, C. E. (1998). Pruning decision trees with misclassification costs. In Proceedings of the European Conference on Machine Learning, pp. 131-136.
[6]
Catlett, J. (1991). Megainduction: machine learning on very large databases. Ph.D. thesis, Department of Computer Science, University of Sydney.
[7]
Chan, P., & Stolfo, S. (1998). Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 164-168, Menlo Park, CA: AAAI Press.
[8]
Chawla, N., Bowyer, K., Hall, L., & Kegelmeyer, W. P. (2000). SMOTE: synthetic minority over-sampling technique. In International Conference on Knowledge Based Computer Systems .
[9]
Cohen, W., & Singer, Y. (1999). A simple, fast, and effective rule learner. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 335-342, Menlo Park, CA: AAAI Press.
[10]
Cohn, D., Atlas, L. and Ladner, R. (1994). Improved generalization with active learning. Machine Learning, 15:201-221.
[11]
Drummond, C., & Holte, R.C. (2000). Exploiting the cost (in)sensitivity of decision tree splitting criteria. In Proceedings of the Seventeenth International Conference on Machine Learning, pp. 239-246.
[12]
Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pp. 973-978.
[13]
Estabrooks, A., & Japkowicz, N. (2001). A Mixture-of-Experts Framework for Concept-Learning from Imbalanced Data Sets. In Proceedings of the 2001 Intelligent Data Analysis Conference.
[14]
Fawcett, T. & Provost, F. (1997). Adaptive Fraud Detection. Data Mining and Knowledge Discovery 1(3): 291-316.
[15]
Good, I. J. (1965). The Estimation of Probabilities. Cambridge, MA: M.I.T. Press.
[16]
Hand, D. J. (1997). Construction and Assessment of Classification Rules. Chichester, UK: John Wiley and Sons.
[17]
Holte, R. C., Acker, L. E., & Porter, B. W. (1989). Concept learning and the problem of small disjuncts. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence , pp. 813-818. San Mateo, CA: Morgan Kaufmann.
[18]
Japkowicz, N. & Stephen, S. (2002). The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis Journal, 6(5).
[19]
Japkowicz, N., Holte, R. C., Ling, C. X., & Matwin S. (Eds.) (2000). In Papers from the AAAI Workshop on Learning from Imbalanced Data Sets. Tech, rep. WS-00-05, Menlo Park, CA: AAAI Press.
[20]
Jensen, D. D., & Cohen, P. R. (2000). Multiple comparisons in induction algorithms. Machine Learning, 38(3): 309-338.
[21]
John, G. H., & Langley, P. (1996). Static versus dynamic sampling for data mining. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining , pp. 367-370. Menlo Park, CA. AAAI Press.
[22]
Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning , pp. 179-186.
[23]
Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on Machine Learning, pp.148-156.
[24]
Provost, F., Fawcett, T., & Kohavi, R. (1998). The case against accuracy estimation for comparing classifiers. In Proceedings of the Fifteenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann.
[25]
Provost, F., Jensen, D., & Oates, T. (1999). Efficient progressive sampling. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining. ACM Press.
[26]
Provost, F., & Fawcett, T (2001). Robust classification for imprecise environments. Machine Learning, 42, 203-231.
[27]
Provost, F., & Domingos, P. (2001). Well-trained PETs: improving probability estimation trees. CeDER Working Paper #IS-00-04, Stern School of Business, New York University, New York, NY.
[28]
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
[29]
Saar-Tsechansky, M., & Provost, F. (2001). Active learning for class probability estimation and ranking. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence , Seattle, WA.
[30]
Saar-Tsechansky, M. and F. Provost (2003). Active Sampling for Class Probability Estimation and Ranking. To appear in Machine Learning.
[31]
SAS Institute (2001). Getting Started With SAS Enterprise Miner. Cary, NC: SAS Institute Inc.
[32]
Saerens, M., Latinne, P., & Decaestecker, C. (2002). Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Computation, 14:21-41.
[33]
Swets, J., Dawes, R., & Monahan, J. (2000). Better decisions through science. Scientific American , October 2000: 82-87.
[34]
Turney P. (2000). Types of cost in inductive learning. In Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on Machine Learning, 15-21, Stanford, CA.
[35]
Van den Bosch A., Weijters, A., Van den Herik, H.J., & Daelemans, W. (1997). When small disjuncts abound, try lazy learning: a case study. In Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning, 109-118.
[36]
Weiss, G.M., & Hirsh, H. (2000). A quantitative study of small disjuncts, In Proceedings of the Seventeenth National Conference on Artificial Intelligence, 665-670. Menlo Park, CA: AAAI Press.
[37]
Weiss, G. M., & Provost, F (2001). The effect of class distribution on classifier learning: an empirical study. Tech rep. ML-TR-44, Department of Computer Science, Rutgers University, New Jersey.
[38]
Zadrozny, B., & Elkan, C. (2001). Learning and making decisions when costs and probabilities are both unknown. Tech. rep. CS2001-0664, Department of Computer Science and Engineering, University of California, San Diego.

Cited By

View all
  • (2024)Active Learning for Data Quality Control: A SurveyJournal of Data and Information Quality10.1145/366336916:2(1-45)Online publication date: 11-May-2024
  • (2024)A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learningMachine Language10.1007/s10994-022-06296-4113:7(4903-4923)Online publication date: 1-Jul-2024
  • (2024)Gender prediction based on University students’ complex thinking competency: An analysis from machine learning approachesEducation and Information Technologies10.1007/s10639-023-11831-429:3(2721-2739)Online publication date: 1-Feb-2024
  • Show More Cited By
  1. Learning when training data are costly: the effect of class distribution on tree induction

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Journal of Artificial Intelligence Research
      Journal of Artificial Intelligence Research  Volume 19, Issue 1
      July 2003
      652 pages

      Publisher

      AI Access Foundation

      El Segundo, CA, United States

      Publication History

      Published: 01 October 2003
      Received: 01 December 2002
      Published in JAIR Volume 19, Issue 1

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 24 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Active Learning for Data Quality Control: A SurveyJournal of Data and Information Quality10.1145/366336916:2(1-45)Online publication date: 11-May-2024
      • (2024)A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learningMachine Language10.1007/s10994-022-06296-4113:7(4903-4923)Online publication date: 1-Jul-2024
      • (2024)Gender prediction based on University students’ complex thinking competency: An analysis from machine learning approachesEducation and Information Technologies10.1007/s10639-023-11831-429:3(2721-2739)Online publication date: 1-Feb-2024
      • (2023)A Best Balance Ratio Ordered Feature Selection Methodology for Robust and Fast Statistical Analysis of Memory DesignsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.321376242:6(1742-1755)Online publication date: 1-Jun-2023
      • (2023)A broad review on class imbalance learning techniquesApplied Soft Computing10.1016/j.asoc.2023.110415143:COnline publication date: 1-Aug-2023
      • (2022)A Hybrid Learning Framework for Imbalanced ClassificationInternational Journal of Intelligent Information Technologies10.4018/IJIIT.30696718:1(1-15)Online publication date: 4-Aug-2022
      • (2022)On the interaction between lexicase selection, modularity and data subsetsProceedings of the Genetic and Evolutionary Computation Conference Companion10.1145/3520304.3528765(586-589)Online publication date: 9-Jul-2022
      • (2022)Classifier Construction Under Budget ConstraintsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517863(1160-1174)Online publication date: 10-Jun-2022
      • (2022)Predicting the specific substrate for transmembrane transport proteins using BERT language model2022 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)10.1109/CIBCB55180.2022.9863051(1-8)Online publication date: 15-Aug-2022
      • (2022)Multi-objective evolutionary optimization using the relationship between F 1 and accuracy metrics in classification tasksApplied Intelligence10.1007/s10489-019-01447-y49:9(3447-3463)Online publication date: 10-Mar-2022
      • Show More Cited By

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media