[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets

Published: 01 July 2015 Publication History

Abstract

An over-sampling technique called V-synth is proposed and compared to borderline SMOTE (bSMOTE), a common methodology used to balance an imbalanced dataset for classification purposes. V-synth is a machine learning methodology that allows synthetic minority points to be generated based on the properties of a Voronoi diagram. A Voronoi diagram is a collection of geometric regions that encapsulate classifying points in such a way that any point within the region is closest to the encapsulated classifier than any other adjacent classifiers based on their distance from one another. Because of properties inherent to Voronoi diagrams, V-synth identifies exclusive regions of feature space where it is ideal to create synthetic minority samples. To test the generalization and application of V-synth, six databases from various problem domains were selected from the University of California Irvine's Machine Learning Repository. Though not always guaranteed due to the random nature of synthetic over-sampling, significant evidence is presented that supports the hypothesis that V-synth more consistently leads to the creation of more accurate and better-balanced classification models than bSMOTE when the classification complexity of a dataset is high.

References

[1]
Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority oversampling technique. J Artif Intell Res 16:321---357
[2]
Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) Advances in intelligent computing. Springer, Berlin, pp 878---887
[3]
Chawla NV (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York, pp 853---857
[4]
Van Hulse J, Khoshgoftaar T (2009) Knowledge discovery from imbalanced and noisy data. Data Knowl Eng 68(12):1513---1542
[5]
He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263---1284
[6]
Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30:25---36
[7]
Seiffert C, Khoshgoftaar TM, Hulse JV, Napolitano A (2008) Building useful models from imbalanced data with sampling and boosting. In: FLAIRS conference, pp 306---311
[8]
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18---36
[9]
Ezawa K, Singh M, Norton S (1996) Learning goal oriented bayesian networks for telecommunications management. In: Proceedings of the international conference on machine learning, Bari, Italy
[10]
Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explor Newsl 6(1):30---39
[11]
Frank A, Asuncion A, Census income data set. University of California, Irvine {Online}. http://archive.ics.uci.edu/ml/datasets/Census+Income
[12]
Frank A, Asuncion A, Credit approval data set. University of California, Irvine, {Online}. http://archive.ics.uci.edu/ml/datasets/Credit+Approval
[13]
Chawla NV (2010) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York, pp 875---886
[14]
Frank A, Asuncion A, Vertebral column data set. University of California, Irvine, {Online}. http://archive.ics.uci.edu/ml/datasets/Vertebral+Column
[15]
Frank A, Asuncion A, Ecoli data set. University of California, Irvine, {Online}. http://archive.ics.uci.edu/ml/datasets/Ecoli
[16]
Frank A, Asuncion A, Yeast data set. University of California, Irvine, {Online}. http://archive.ics.uci.edu/ml/datasets/Yeast
[17]
Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the sixteenth ACM conference on information and knowledge management, 2007
[18]
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429---449
[19]
Young WA, Holland WS, Weckman GR (2008) Determining hall of fame status for major league baseball using an artificial neural network. J Quant Anal Sports 4(4):1---44
[20]
Duchesnay E, Cachia A, Boddaert N, Chabane N, Mangin J, Martinot JBF, Zilbovicius M (2011) Feature selection and classification of imbalanced datasets: application to PET images of children with autistic spectrum disorders. NeuroImage 57(3):1003---1014
[21]
Weckman G, Paschold H, Dowler J, Whiting H, Young W (2010) Using neural networks with limited data to estimate manufacturing cost. J Indus Syst Eng 3(4):257---274
[22]
Chen C, Liaw A, Breiman L (2006) Using random forest to learn imbalanced data. Department of Statistics, University of California, Berkeley
[23]
Drummond C, Holte R (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats oversampling. In: ICML'03 workshop on learning from imbalanced data sets, 2003
[24]
Jiang X, El-Kareh R, Ohno-Machado L (2011) Improving predictions in imbalanced data using pairwise expanded logistic regression. In: AMIA annual symposium proceedings archive, 2011
[25]
Nguyen H, Cooper E, Kamei K (2009) Borderline over-sampling for imbalanced data classification. In: Fifth international workshop on computational intelligence and applications, Japan, 2009
[26]
Toribio P, Alejo R, Valdovinos R, Pacheco-Sanchez JH (2012) Using Gabriel graphs in Borderline-SMOTE to deal with severe two-class imbalance problems on neural networks. In: Proceedings of CCIA, pp 29---36
[27]
Saez J, Luengo J, Stefanowski J, Herrera F (2015) SMOTE---IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184---203
[28]
Frank A, Asuncion A (2010) UCI machine learning repository {Online}. http://archive.ics.uci.edu/ml
[29]
Fawcett T (2004) ROC graphs: notes and practical considerations for researchers. HP Laboratories, Palo Alto
[30]
Millie D, Weckman G, Young W, Ivey J, Carrick H, Fahnenstiel G (2012) Modeling microalga abundance with artificial neural networks: demonstration of a heuristic, `Grey-Box' technique to deconvolve and quantify environmental influences. Environ Model Softw 37:27---39
[31]
Visa S, Ralescu A (2005) Issues in mining imbalanced data sets: a review paper. In: Proceedings of the sixteenth midwest artificial intelligence and cognitive science conference, 2005
[32]
Lopex V, Fernandez A, Jose del Jesus M, Herrera F (2012) Cost sensitive and preprocessing for classification with imbalanced data-sets: similar behaviour and potential hybridizations. Proc ICPRAM 2:98---107
[33]
Weissenbacher A, Kasess B, Gerstl F, Lanzenberger R, Moser E, Windischberger C (2009) Correlations and anticorrelations in resting-state functional connectivity MRI: a quantitative comparison of preprocessing strategies. NeuroImage 47(4):1408---1416
[34]
Debar H, Wespi A (2001) Aggregation and correlation of intrusion---detection alerts. Recent Adv Intrusion Detect Lect Notes Comput Sci 2212:85---103
[35]
Aboy M, McNames J, Thong T, Tsunami D, Ellenby M, Goldstein B (2005) An automatic beat detection algorithm for pressure signals. IEEE Biomed Eng 52(10):1662---1670
[36]
Sack J (1999) Handbook of computational geometry. Elsevier, Amsterdam
[37]
Voronoi G (1907) Nouvelles applications des paramtres continus la thorie des formes quadratiques. Journal fr die Reine und Angewandte Mathematik 97---178
[38]
Chew P (2007) Delaunay triangulation. {Online}. http://www.cs.cornell.edu/home/chew/Delaunay.html
[39]
Barber CB, Dobkin DP, Huhdanpaa H (1996) The quickhull algorithm for convex hulls. ACM Trans Math Softw 22(4):469---483
[40]
Coxter H (1973) Regular polytopes, 3rd edn. Dover Publications Inc, New York
[41]
Kohavi R (1996) Scaling up the accuracy of Naive---Bayes classifiers: a decision-tree hybrid. In: Proceedings of the second international conference on knowledge discovery and data mining, 1996
[42]
Quinlan J (1999) Simplifying decision trees. Int J Man Mach Stud 27(3):221---234
[43]
Frank A, Asuncion A, Haberman's survival data set. University of California, Irvine, {Online}. http://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival
[44]
Haberman SJ (1976) Generalized residuals for log-linear models. In: Proceedings of the 9th international biometrics conference, Boston
[45]
Hortan P, Nakai K (1996) A probabilistic classification system for predicting the cellular localization sites of proteins. In: Proceeding of the fourth international conference on intelligent systems for molecular biology, pp 109---115
[46]
Ho T, Basu M (2012) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Learn 24(3):289---300

Cited By

View all
  1. Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Neural Computing and Applications
    Neural Computing and Applications  Volume 26, Issue 5
    Jul 2015
    249 pages
    ISSN:0941-0643
    EISSN:1433-3058
    Issue’s Table of Contents

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 01 July 2015

    Author Tags

    1. Data engineering
    2. Data mining
    3. Imbalanced datasets
    4. Knowledge extraction
    5. Numerical algorithms
    6. Synthetic over-sampling

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Iterative minority oversampling and its ensemble for ordinal imbalanced datasetsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107211127:PAOnline publication date: 1-Feb-2024
    • (2021)LoRAS: an oversampling approach for imbalanced datasetsMachine Language10.1007/s10994-020-05913-4110:2(279-301)Online publication date: 1-Feb-2021
    • (2021)A Hybrid Framework for Class-Imbalanced ClassificationWireless Algorithms, Systems, and Applications10.1007/978-3-030-85928-2_24(301-313)Online publication date: 25-Jun-2021
    • (2020)Balanced training of a hybrid ensemble method for imbalanced datasets: a case of emergency department readmission predictionNeural Computing and Applications10.1007/s00521-017-3242-y32:10(5735-5744)Online publication date: 1-May-2020
    • (2018)SMOTE for learning from imbalanced dataJournal of Artificial Intelligence Research10.5555/3241691.324171261:1(863-905)Online publication date: 1-Jan-2018

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media