More Web Proxy on the site http://driver.im/

article

Learning when training data are costly: the effect of class distribution on tree induction

Authors:

Foster ProvostAuthors Info & Claims

Journal of Artificial Intelligence Research, Volume 19, Issue 1

Pages 315 - 354

Published: 01 October 2003 Publication History

Abstract

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples and/or the computational costs associated with learning from them. In such circumstances, one question of practical importance is: if only n training examples can be selected, in what proportion should the classes be represented? In this article we help to answer this question by analyzing, for a fixed training-set size, the relationship between the class distribution of the training data and the performance of classification trees induced from these data. We study twenty-six data sets and, for each, determine the best class distribution for learning. The naturally occurring class distribution is shown to generally perform well when classifier performance is evaluated using undifferentiated error rate (0/1 loss). However, when the area under the ROC curve is used to evaluate classifier performance, a balanced distribution is shown to perform well. Since neither of these choices for class distribution always generates the best-performing classifier, we introduce a "budget-sensitive" progressive sampling algorithm for selecting training examples based on the class associated with each example. An empirical analysis of this algorithm shows that the class distribution of the resulting training set yields classifiers with good (nearly-optimal) classification performance.

References

[1]

Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning, 36, 105-139.

Digital Library

[2]

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Belmont, CA: Wadsworth International Group.

[3]

Blake, C., & Merz, C. (1998). UCI Repository of Machine Learning Databases, (http://www.ics.uci.edu/~mlearn/MLRepository.html), Department of Computer Science, University of California.

[4]

Bradley, A. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145-1159.

Digital Library

[5]

Bradford, J.P., Kunz, C., Kohavi, R., Brunk, C., & Brodley, C. E. (1998). Pruning decision trees with misclassification costs. In Proceedings of the European Conference on Machine Learning, pp. 131-136.

Digital Library

[6]

Catlett, J. (1991). Megainduction: machine learning on very large databases. Ph.D. thesis, Department of Computer Science, University of Sydney.

[7]

Chan, P., & Stolfo, S. (1998). Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 164-168, Menlo Park, CA: AAAI Press.

[8]

Chawla, N., Bowyer, K., Hall, L., & Kegelmeyer, W. P. (2000). SMOTE: synthetic minority over-sampling technique. In International Conference on Knowledge Based Computer Systems .

[9]

Cohen, W., & Singer, Y. (1999). A simple, fast, and effective rule learner. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 335-342, Menlo Park, CA: AAAI Press.

Digital Library

[10]

Cohn, D., Atlas, L. and Ladner, R. (1994). Improved generalization with active learning. Machine Learning, 15:201-221.

Digital Library

[11]

Drummond, C., & Holte, R.C. (2000). Exploiting the cost (in)sensitivity of decision tree splitting criteria. In Proceedings of the Seventeenth International Conference on Machine Learning, pp. 239-246.

Digital Library

[12]

Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pp. 973-978.

Digital Library

[13]

Estabrooks, A., & Japkowicz, N. (2001). A Mixture-of-Experts Framework for Concept-Learning from Imbalanced Data Sets. In Proceedings of the 2001 Intelligent Data Analysis Conference.

Digital Library

[14]

Fawcett, T. & Provost, F. (1997). Adaptive Fraud Detection. Data Mining and Knowledge Discovery 1(3): 291-316.

Digital Library

[15]

Good, I. J. (1965). The Estimation of Probabilities. Cambridge, MA: M.I.T. Press.

[16]

Hand, D. J. (1997). Construction and Assessment of Classification Rules. Chichester, UK: John Wiley and Sons.

[17]

Holte, R. C., Acker, L. E., & Porter, B. W. (1989). Concept learning and the problem of small disjuncts. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence , pp. 813-818. San Mateo, CA: Morgan Kaufmann.

Digital Library

[18]

Japkowicz, N. & Stephen, S. (2002). The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis Journal, 6(5).

Digital Library

[19]

Japkowicz, N., Holte, R. C., Ling, C. X., & Matwin S. (Eds.) (2000). In Papers from the AAAI Workshop on Learning from Imbalanced Data Sets. Tech, rep. WS-00-05, Menlo Park, CA: AAAI Press.

[20]

Jensen, D. D., & Cohen, P. R. (2000). Multiple comparisons in induction algorithms. Machine Learning, 38(3): 309-338.

Digital Library

[21]

John, G. H., & Langley, P. (1996). Static versus dynamic sampling for data mining. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining , pp. 367-370. Menlo Park, CA. AAAI Press.

Digital Library

[22]

Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning , pp. 179-186.

[23]

Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on Machine Learning, pp.148-156.

[24]

Provost, F., Fawcett, T., & Kohavi, R. (1998). The case against accuracy estimation for comparing classifiers. In Proceedings of the Fifteenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann.

Digital Library

[25]

Provost, F., Jensen, D., & Oates, T. (1999). Efficient progressive sampling. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining. ACM Press.

Digital Library

[26]

Provost, F., & Fawcett, T (2001). Robust classification for imprecise environments. Machine Learning, 42, 203-231.

Digital Library

[27]

Provost, F., & Domingos, P. (2001). Well-trained PETs: improving probability estimation trees. CeDER Working Paper #IS-00-04, Stern School of Business, New York University, New York, NY.

[28]

Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.

Digital Library

[29]

Saar-Tsechansky, M., & Provost, F. (2001). Active learning for class probability estimation and ranking. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence , Seattle, WA.

Digital Library

[30]

Saar-Tsechansky, M. and F. Provost (2003). Active Sampling for Class Probability Estimation and Ranking. To appear in Machine Learning.

Digital Library

[31]

SAS Institute (2001). Getting Started With SAS Enterprise Miner. Cary, NC: SAS Institute Inc.

[32]

Saerens, M., Latinne, P., & Decaestecker, C. (2002). Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Computation, 14:21-41.

Digital Library

[33]

Swets, J., Dawes, R., & Monahan, J. (2000). Better decisions through science. Scientific American , October 2000: 82-87.

[34]

Turney P. (2000). Types of cost in inductive learning. In Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on Machine Learning, 15-21, Stanford, CA.

[35]

Van den Bosch A., Weijters, A., Van den Herik, H.J., & Daelemans, W. (1997). When small disjuncts abound, try lazy learning: a case study. In Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning, 109-118.

[36]

Weiss, G.M., & Hirsh, H. (2000). A quantitative study of small disjuncts, In Proceedings of the Seventeenth National Conference on Artificial Intelligence, 665-670. Menlo Park, CA: AAAI Press.

Digital Library

[37]

Weiss, G. M., & Provost, F (2001). The effect of class distribution on classifier learning: an empirical study. Tech rep. ML-TR-44, Department of Computer Science, Rutgers University, New Jersey.

[38]

Zadrozny, B., & Elkan, C. (2001). Learning and making decisions when costs and probabilities are both unknown. Tech. rep. CS2001-0664, Department of Computer Science and Engineering, University of California, San Diego.

Digital Library

Cited By

Li NQi YLi CZhao Z(2024)Active Learning for Data Quality Control: A SurveyJournal of Data and Information Quality10.1145/366336916:2(1-45)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3663369
Elreedy DAtiya AKamalov F(2024)A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learningMachine Language10.1007/s10994-022-06296-4113:7(4903-4923)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s10994-022-06296-4
Ibarra-Vazquez GRamírez-Montoya MTerashima H(2024)Gender prediction based on University students’ complex thinking competency: An analysis from machine learning approachesEducation and Information Technologies10.1007/s10639-023-11831-429:3(2721-2739)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s10639-023-11831-4
Show More Cited By

Learning when training data are costly: the effect of class distribution on tree induction
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches

Recommendations

Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
Maximizing classifier utility when training data is costly

Classification is a well-studied problem in machine learning and data mining. Classifier performance was originally gauged almost exclusively using predictive accuracy. However, as work in the field progressed, more sophisticated measures of classifier ...
A co-learning framework for learning user search intents from rule-generated training data
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Learning to understand user search intents from their online behaviors is crucial for both Web search and online advertising. However, it is a challenging task to collect and label a sufficient amount of high quality training data for various user ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Artificial Intelligence Research

Journal of Artificial Intelligence Research Volume 19, Issue 1

July 2003

652 pages

ISSN:1076-9757

Issue’s Table of Contents

Publisher

AI Access Foundation

El Segundo, CA, United States

Publication History

Published: 01 October 2003

Received: 01 December 2002

Published in JAIR Volume 19, Issue 1

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

184
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li NQi YLi CZhao Z(2024)Active Learning for Data Quality Control: A SurveyJournal of Data and Information Quality10.1145/366336916:2(1-45)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3663369
Elreedy DAtiya AKamalov F(2024)A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learningMachine Language10.1007/s10994-022-06296-4113:7(4903-4923)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s10994-022-06296-4
Ibarra-Vazquez GRamírez-Montoya MTerashima H(2024)Gender prediction based on University students’ complex thinking competency: An analysis from machine learning approachesEducation and Information Technologies10.1007/s10639-023-11831-429:3(2721-2739)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s10639-023-11831-4
Shaer LKanj RJoshi R(2023)A Best Balance Ratio Ordered Feature Selection Methodology for Robust and Fast Statistical Analysis of Memory DesignsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.321376242:6(1742-1755)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TCAD.2022.3213762
Rezvani SWang X(2023)A broad review on class imbalance learning techniquesApplied Soft Computing10.1016/j.asoc.2023.110415143:COnline publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.asoc.2023.110415
Jiang E(2022)A Hybrid Learning Framework for Imbalanced ClassificationInternational Journal of Intelligent Information Technologies10.4018/IJIIT.30696718:1(1-15)Online publication date: 4-Aug-2022
https://dl.acm.org/doi/10.4018/IJIIT.306967
Portman BHeywood MWagner M(2022)On the interaction between lexicase selection, modularity and data subsetsProceedings of the Genetic and Evolutionary Computation Conference Companion10.1145/3520304.3528765(586-589)Online publication date: 9-Jul-2022
https://dl.acm.org/doi/10.1145/3520304.3528765
Gershtein SMilo TNovgorodov SRazmadze KIves ZBonifati AEl Abbadi A(2022)Classifier Construction Under Budget ConstraintsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517863(1160-1174)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517863
Ataei SButler G(2022)Predicting the specific substrate for transmembrane transport proteins using BERT language model2022 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)10.1109/CIBCB55180.2022.9863051(1-8)Online publication date: 15-Aug-2022
https://dl.acm.org/doi/10.1109/CIBCB55180.2022.9863051
Fernández JCarbonero MGutiérrez PHervás-Martínez C(2022)Multi-objective evolutionary optimization using the relationship between F 1 and accuracy metrics in classification tasksApplied Intelligence10.1007/s10489-019-01447-y49:9(3447-3463)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1007/s10489-019-01447-y
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents