More Web Proxy on the site http://driver.im/

article

Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets

Authors:

William A. Young, Ii,

Gary R. Weckman,

David M. ChelbergAuthors Info & Claims

Neural Computing and Applications, Volume 26, Issue 5

Pages 1041 - 1054

https://doi.org/10.1007/s00521-014-1780-0

Published: 01 July 2015 Publication History

Abstract

An over-sampling technique called V-synth is proposed and compared to borderline SMOTE (bSMOTE), a common methodology used to balance an imbalanced dataset for classification purposes. V-synth is a machine learning methodology that allows synthetic minority points to be generated based on the properties of a Voronoi diagram. A Voronoi diagram is a collection of geometric regions that encapsulate classifying points in such a way that any point within the region is closest to the encapsulated classifier than any other adjacent classifiers based on their distance from one another. Because of properties inherent to Voronoi diagrams, V-synth identifies exclusive regions of feature space where it is ideal to create synthetic minority samples. To test the generalization and application of V-synth, six databases from various problem domains were selected from the University of California Irvine's Machine Learning Repository. Though not always guaranteed due to the random nature of synthetic over-sampling, significant evidence is presented that supports the hypothesis that V-synth more consistently leads to the creation of more accurate and better-balanced classification models than bSMOTE when the classification complexity of a dataset is high.

References

[1]

Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority oversampling technique. J Artif Intell Res 16:321---357

[2]

Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) Advances in intelligent computing. Springer, Berlin, pp 878---887

[3]

Chawla NV (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York, pp 853---857

[4]

Van Hulse J, Khoshgoftaar T (2009) Knowledge discovery from imbalanced and noisy data. Data Knowl Eng 68(12):1513---1542

Digital Library

[5]

He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263---1284

Digital Library

[6]

Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30:25---36

[7]

Seiffert C, Khoshgoftaar TM, Hulse JV, Napolitano A (2008) Building useful models from imbalanced data with sampling and boosting. In: FLAIRS conference, pp 306---311

[8]

Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18---36

[9]

Ezawa K, Singh M, Norton S (1996) Learning goal oriented bayesian networks for telecommunications management. In: Proceedings of the international conference on machine learning, Bari, Italy

[10]

Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explor Newsl 6(1):30---39

Digital Library

[11]

Frank A, Asuncion A, Census income data set. University of California, Irvine {Online}. http://archive.ics.uci.edu/ml/datasets/Census+Income

[12]

Frank A, Asuncion A, Credit approval data set. University of California, Irvine, {Online}. http://archive.ics.uci.edu/ml/datasets/Credit+Approval

[13]

Chawla NV (2010) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York, pp 875---886

[14]

Frank A, Asuncion A, Vertebral column data set. University of California, Irvine, {Online}. http://archive.ics.uci.edu/ml/datasets/Vertebral+Column

[15]

Frank A, Asuncion A, Ecoli data set. University of California, Irvine, {Online}. http://archive.ics.uci.edu/ml/datasets/Ecoli

[16]

Frank A, Asuncion A, Yeast data set. University of California, Irvine, {Online}. http://archive.ics.uci.edu/ml/datasets/Yeast

[17]

Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the sixteenth ACM conference on information and knowledge management, 2007

Digital Library

[18]

Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429---449

[19]

Young WA, Holland WS, Weckman GR (2008) Determining hall of fame status for major league baseball using an artificial neural network. J Quant Anal Sports 4(4):1---44

[20]

Duchesnay E, Cachia A, Boddaert N, Chabane N, Mangin J, Martinot JBF, Zilbovicius M (2011) Feature selection and classification of imbalanced datasets: application to PET images of children with autistic spectrum disorders. NeuroImage 57(3):1003---1014

[21]

Weckman G, Paschold H, Dowler J, Whiting H, Young W (2010) Using neural networks with limited data to estimate manufacturing cost. J Indus Syst Eng 3(4):257---274

[22]

Chen C, Liaw A, Breiman L (2006) Using random forest to learn imbalanced data. Department of Statistics, University of California, Berkeley

[23]

Drummond C, Holte R (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats oversampling. In: ICML'03 workshop on learning from imbalanced data sets, 2003

[24]

Jiang X, El-Kareh R, Ohno-Machado L (2011) Improving predictions in imbalanced data using pairwise expanded logistic regression. In: AMIA annual symposium proceedings archive, 2011

[25]

Nguyen H, Cooper E, Kamei K (2009) Borderline over-sampling for imbalanced data classification. In: Fifth international workshop on computational intelligence and applications, Japan, 2009

[26]

Toribio P, Alejo R, Valdovinos R, Pacheco-Sanchez JH (2012) Using Gabriel graphs in Borderline-SMOTE to deal with severe two-class imbalance problems on neural networks. In: Proceedings of CCIA, pp 29---36

[27]

Saez J, Luengo J, Stefanowski J, Herrera F (2015) SMOTE---IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184---203

Digital Library

[28]

Frank A, Asuncion A (2010) UCI machine learning repository {Online}. http://archive.ics.uci.edu/ml

[29]

Fawcett T (2004) ROC graphs: notes and practical considerations for researchers. HP Laboratories, Palo Alto

[30]

Millie D, Weckman G, Young W, Ivey J, Carrick H, Fahnenstiel G (2012) Modeling microalga abundance with artificial neural networks: demonstration of a heuristic, `Grey-Box' technique to deconvolve and quantify environmental influences. Environ Model Softw 37:27---39

[31]

Visa S, Ralescu A (2005) Issues in mining imbalanced data sets: a review paper. In: Proceedings of the sixteenth midwest artificial intelligence and cognitive science conference, 2005

[32]

Lopex V, Fernandez A, Jose del Jesus M, Herrera F (2012) Cost sensitive and preprocessing for classification with imbalanced data-sets: similar behaviour and potential hybridizations. Proc ICPRAM 2:98---107

[33]

Weissenbacher A, Kasess B, Gerstl F, Lanzenberger R, Moser E, Windischberger C (2009) Correlations and anticorrelations in resting-state functional connectivity MRI: a quantitative comparison of preprocessing strategies. NeuroImage 47(4):1408---1416

[34]

Debar H, Wespi A (2001) Aggregation and correlation of intrusion---detection alerts. Recent Adv Intrusion Detect Lect Notes Comput Sci 2212:85---103

[35]

Aboy M, McNames J, Thong T, Tsunami D, Ellenby M, Goldstein B (2005) An automatic beat detection algorithm for pressure signals. IEEE Biomed Eng 52(10):1662---1670

[36]

Sack J (1999) Handbook of computational geometry. Elsevier, Amsterdam

[37]

Voronoi G (1907) Nouvelles applications des paramtres continus la thorie des formes quadratiques. Journal fr die Reine und Angewandte Mathematik 97---178

[38]

Chew P (2007) Delaunay triangulation. {Online}. http://www.cs.cornell.edu/home/chew/Delaunay.html

[39]

Barber CB, Dobkin DP, Huhdanpaa H (1996) The quickhull algorithm for convex hulls. ACM Trans Math Softw 22(4):469---483

Digital Library

[40]

Coxter H (1973) Regular polytopes, 3rd edn. Dover Publications Inc, New York

[41]

Kohavi R (1996) Scaling up the accuracy of Naive---Bayes classifiers: a decision-tree hybrid. In: Proceedings of the second international conference on knowledge discovery and data mining, 1996

Digital Library

[42]

Quinlan J (1999) Simplifying decision trees. Int J Man Mach Stud 27(3):221---234

[43]

Frank A, Asuncion A, Haberman's survival data set. University of California, Irvine, {Online}. http://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival

[44]

Haberman SJ (1976) Generalized residuals for log-linear models. In: Proceedings of the 9th international biometrics conference, Boston

[45]

Hortan P, Nakai K (1996) A probabilistic classification system for predicting the cellular localization sites of proteins. In: Proceeding of the fourth international conference on intelligent systems for molecular biology, pp 109---115

[46]

Ho T, Basu M (2012) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Learn 24(3):289---300

Cited By

Wang NZhang ZLuo X(2024)Iterative minority oversampling and its ensemble for ordinal imbalanced datasetsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107211127:PAOnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.engappai.2023.107211
Bej SDavtyan NWolfien MNassar MWolkenhauer O(2021)LoRAS: an oversampling approach for imbalanced datasetsMachine Language10.1007/s10994-020-05913-4110:2(279-301)Online publication date: 1-Feb-2021
https://dl.acm.org/doi/10.1007/s10994-020-05913-4
Chen RLuo LChen YXia JGuo D(2021)A Hybrid Framework for Class-Imbalanced ClassificationWireless Algorithms, Systems, and Applications10.1007/978-3-030-85928-2_24(301-313)Online publication date: 25-Jun-2021
https://dl.acm.org/doi/10.1007/978-3-030-85928-2_24
Show More Cited By

Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning

Recommendations

Impact on Bayesian Networks Classifiers When Learning from Imbalanced Datasets
ICAART 2015: Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2

In this paper we present a study on the behaviour of some representative Bayesian Networks Classifiers

(BNCs), when the dataset they are learned from presents imbalanced data, that is, there are far fewer cases

labelled with a particular class value ...
MahalCUSFilter: A Hybrid Undersampling Method to Improve the Minority Classification Rate of Imbalanced Datasets
Mining Intelligence and Knowledge Exploration
Abstract
Class Imbalance problem has received considerable attention in the machine learning research. Among the methods which handle class imbalance problem, undersampling is a data level approach which preprocesses the data set to reduce the size of the ...
Self-adaptive oversampling method based on the complexity of minority data in imbalanced datasets classification
Abstract
Learning from imbalanced datasets is a nontrivial task for supervised learning community. Traditional classifiers may have difficulties to learn the concept related to the minority class when addressing imbalanced classification and ...
Highlights
- A self-adaptive oversampling method based on minority data complexity is presented.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Neural Computing and Applications

Neural Computing and Applications Volume 26, Issue 5

Jul 2015

249 pages

ISSN:0941-0643

EISSN:1433-3058

Issue’s Table of Contents

Copyright © Copyright © 2015 The Natural Computing Applications Forum.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 July 2015

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang NZhang ZLuo X(2024)Iterative minority oversampling and its ensemble for ordinal imbalanced datasetsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107211127:PAOnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.engappai.2023.107211
Bej SDavtyan NWolfien MNassar MWolkenhauer O(2021)LoRAS: an oversampling approach for imbalanced datasetsMachine Language10.1007/s10994-020-05913-4110:2(279-301)Online publication date: 1-Feb-2021
https://dl.acm.org/doi/10.1007/s10994-020-05913-4
Chen RLuo LChen YXia JGuo D(2021)A Hybrid Framework for Class-Imbalanced ClassificationWireless Algorithms, Systems, and Applications10.1007/978-3-030-85928-2_24(301-313)Online publication date: 25-Jun-2021
https://dl.acm.org/doi/10.1007/978-3-030-85928-2_24
Artetxe AGraña MBeristain ARíos S(2020)Balanced training of a hybrid ensemble method for imbalanced datasets: a case of emergency department readmission predictionNeural Computing and Applications10.1007/s00521-017-3242-y32:10(5735-5744)Online publication date: 1-May-2020
https://dl.acm.org/doi/10.1007/s00521-017-3242-y
Fernández AGarcía SHerrera FChawla N(2018)SMOTE for learning from imbalanced dataJournal of Artificial Intelligence Research10.5555/3241691.324171261:1(863-905)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.5555/3241691.3241712

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents