[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution

Published: 01 June 2005 Publication History

Abstract

An imbalanced training data set can pose serious problems for many real-world data mining tasks that employ SVMs to conduct supervised learning. In this paper, we propose a kernel-boundary-alignment algorithm, which considers THE training data imbalance as prior information to augment SVMs to improve class-prediction accuracy. Using a simple example, we first show that SVMs can suffer from high incidences of false negatives when the training instances of the target class are heavily outnumbered by the training instances of a nontarget class. The remedy we propose is to adjust the class boundary by modifying the kernel matrix, according to the imbalanced data distribution. Through theoretical analysis backed by empirical study, we show that our kernel-boundary-alignment algorithm works effectively on several data sets.

References

[1]
S. Amari and S. Wu, “Improving Support Vector Machine Classifiers by Modifying Kernel Functions,” Neural Networks, vol. 12, no. 6, pp. 783-789, 1999.
[2]
A.P. Bradley, “The Use of the Area under the Roc Curve in the Evaluation of Machine Learning Algorithms,” Pattern Recognition, vol. 30, no. 7, pp. 1145-1159, 1997.
[3]
L. Breiman, “Bagging Predictors,” Machine Learning, vol. 24, no. 2, pp. 123-140, 1996.
[4]
C. Burges, “Geometry and Invariance in Kernel Based Methods,” Advances in Kernel Methods: Support Vector Learning, Cambridge, Mass.: MIT Press, 1999.
[5]
C. Cardie and N. Howe, “Improving Minority Class Prediction Using Case-Specific Feature Weights,” Proc. 14th Int'l Conf. Machine Learning, pp. 57-65, 1997.
[6]
P. Chan and S. Stolfo, “Learning with Non-Uniform Class and Cost Distributions: Effects and a Distributed Multi-Classifier Approach,” Proc. Workshop Notes KDD-98 Distributed Data Mining, pp. 1-9, 1998.
[7]
N. Chawla K. Bowyer L. Hall and W.P. Kegelmeyer, “Smote: Synthetic Minority Over-Sampling Technique,” J. Artificial Intelligence and Research, vol. 16, pp. 321-357, 2002.
[8]
N. Cristianini J. Shawe-Taylor and J. Kandola, “On Kernel Target Alignment,” Proc. Neural Information Processing Systems, pp. 367-373, 2001.
[9]
T. Dietterich and G. Bakiri, “Solving Multiclass Learning Problems via Error-Correcting Output Codes,” J. Artifical Intelligence Research, vol. 2, pp. 263-286, 1995.
[10]
C. Drummond and R. Holte, “Exploiting the Cost (in)Sensitivity of Decision Tree Splitting Criteria,” Proc. 17th Int'l Conf. Machine Learning, pp. 239-246, 2000.
[11]
T. Fawcett and F. Provost, “Adaptive Fraud Detection,” Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 291-316, 1997.
[12]
K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed. Boston: Academic Press, 1990.
[13]
T. Hastie R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer, 2001.
[14]
T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf. Machine Learning, pp. 137-142, 1998.
[15]
J. Kandola and J. Shawe-Taylor, “Refining Kernels for Regression and Uneven Classification Problems,” Proc. Ninth Int'l Workshop Artificial Intelligence and Statistics, 2003.
[16]
G. Karakoulas and J.S. Taylor, “Optimizing Classifiers for Imbalanced Training Sets,” Advances in Neural Information Processing Systems, 1999.
[17]
M. Kubat and S. Matwin, “Addressing the Curse of Imbalanced Training Sets: One-Sided Selection,” Proc. 14th Int'l Conf. Machine Learning, pp. 179-186, 1997.
[18]
H.W. Kuhn and A.W. Tucker, “Non-Linear Programming,” Proc. Second Berkeley Symp. Math. Statistics and Probability, 1961.
[19]
Y. Lin Y. Lee and G. Wahba, “Support Vector Machines for Classification in Nonstandard Situations,” Machine Learning, vol. 46, pp. 191-202, 2002.
[20]
A. Nugroho S. Kuroyanagi and A. Iwata, “A Solution for Imbalanced Training Sets Problem by Combnet-ii and Its Application on Fog Forecasting,” IEICE Trans. Information and Systems, vol. E85-D, no. 7, pp. 1165-1174, July 2002.
[21]
B. Scholkopf and A. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, Mass.: MIT Press, 2002.
[22]
S. Tong and E. Chang, “Support Vector Machine Active Learning for Image Retrieval,” Proc. ACM Int'l Conf. Multimedia, pp. 107-118, 2001.
[23]
V. Vapnik, The Nature of Statistical Learning Theory. New York: Springer, 1995.
[24]
K. Veropoulos C. Campbell and N. Cristianini, “Controlling the Sensitivity of Support Vector Machines,” Proc. Int'l Joint Conf. Artificial Intelligence, pp. 55-60, 1999.
[25]
G.M. Weiss, “Mining with Rarity: A Unifying Framework,” SIGKDD Explorations, vol. 6, no. 1, pp. 7-19, June 2004.
[26]
G.M. Weiss and F. Provost, “Learning When Training Data Are Costly: The Effect of Class Distribution on Tree Induction,” J.nbspArtificial Intelligence Research, vol. 19, pp. 315-354, 2003.
[27]
G. Wu and E. Chang, “Adaptive Feature-Space Conformal Transformation for Imbalanced Data Learning,” Proc. 20th Int'l Conf. Machine Learning, pp. 816-823, 2003.
[28]
G. Wu Y. Wu L. Jiao Y.-F. Wang and E. Chang, “Multi-Camera Spatio-Temporal Fusion and Biased Sequence-Data Learning for Security Surveillance,” Proc. ACM Int'l Conf. Multimedia, Nov. 2003.
[29]
X. Wu and R. Srihari, “New ν-Support Vector Machines and Their Sequential Minimal Optimization,” Proc. 20th Int'l Conf. Machine Learning, 2003.
[30]
X. Wu and R. Srihari, “Incorporating Prior Knowledge with Weighted Margin Support Vector Machines,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2004.

Cited By

View all
  • (2024)An empirical evaluation of imbalanced data strategies from a practitioner’s point of viewExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124863256:COnline publication date: 5-Dec-2024
  • (2024)Cost aware LSTM model for predicting hard disk drive failures based on extremely imbalanced S.M.A.R.T. sensors dataEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107339127:PBOnline publication date: 1-Jan-2024
  • (2024)A multi-species pest recognition and counting method based on a density map in the greenhouseComputers and Electronics in Agriculture10.1016/j.compag.2023.108554217:COnline publication date: 1-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering  Volume 17, Issue 6
June 2005
144 pages

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 June 2005

Author Tags

  1. Imbalanced-data training
  2. Index Terms- Imbalanced-data training
  3. supervised classification.
  4. support vector machines

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)An empirical evaluation of imbalanced data strategies from a practitioner’s point of viewExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124863256:COnline publication date: 5-Dec-2024
  • (2024)Cost aware LSTM model for predicting hard disk drive failures based on extremely imbalanced S.M.A.R.T. sensors dataEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107339127:PBOnline publication date: 1-Jan-2024
  • (2024)A multi-species pest recognition and counting method based on a density map in the greenhouseComputers and Electronics in Agriculture10.1016/j.compag.2023.108554217:COnline publication date: 1-Feb-2024
  • (2024)Efficient and robust active learning methods for interactive database explorationThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00816-x33:4(931-956)Online publication date: 1-Jul-2024
  • (2024)Methods for class-imbalanced learning with support vector machines: a review and an empirical evaluationSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-024-09931-528:20(11873-11894)Online publication date: 1-Oct-2024
  • (2023)Forecasting cholera disease using SARIMA and LSTM models with discrete wavelet transform as feature selectionJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22390145:3(3901-3913)Online publication date: 1-Jan-2023
  • (2023)Identification of Dry Bean Varieties Based on Multiple Attributes Using CatBoost Machine Learning AlgorithmScientific Programming10.1155/2023/25560662023Online publication date: 1-Jan-2023
  • (2023)Cost-Sensitive Online Adaptive Kernel Learning for Large-Scale Imbalanced ClassificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326664835:10(10554-10568)Online publication date: 12-Apr-2023
  • (2023)Travel Mode Choice Prediction Using Imbalanced Machine LearningIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.323768124:4(3795-3808)Online publication date: 1-Apr-2023
  • (2023)A Best Balance Ratio Ordered Feature Selection Methodology for Robust and Fast Statistical Analysis of Memory DesignsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.321376242:6(1742-1755)Online publication date: 1-Jun-2023
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media