A Preliminary Study of SMOTE on Imbalanced Big Datasets When Dealing with Sparse and Dense High Dimensionality

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13264))

Included in the following conference series:

Mexican Conference on Pattern Recognition

692 Accesses
5 Citations

Abstract

The interest in exploiting big datasets with machine learning has led to adapting classic strategies in this new paradigm determined by volume, speed, and variety. Because data quality is a determining factor in constructing a classifier, it has also been necessary to adapt or develop new data preprocessing techniques. One of the challenges of most significant interest is the class imbalance problem, where the class of interest has a smaller number of examples concerning another class called the majority. To alleviate this problem, one of the most recognized techniques is SMOTE, which is characterized by generating instances of the minority class through a process that uses the nearest neighbor rule and the Euclidean distance. Various articles have shown that SMOTE is not appropriate for datasets with high dimensionality. However, in big data, datasets with high dimensionality have contained many zeros. Therefore, in this article, our objective is to analyze the SMOTE-BD behavior on imbalanced big datasets with sparse and dense dimensionality. Experimental results using two classifiers and big datasets with different dimensionalities suggest that sparsity is a predominant factor than the dimensionality in the behavior of SMOTE-BD.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 51.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MaMiPot: a paradigm shift for the classification of imbalanced data

Article 07 December 2022

Performance Analysis of Machine Learning Algorithms on Imbalanced Datasets Using SMOTE Technique

Types of minority class examples and their influence on learning classifiers from imbalanced data

Article Open access 09 July 2015

Notes

1.
Although this medium-high dimensional dataset may not represent a big data problem in terms of volume, we believe it can be treated as such since it may not be processed and analyzed on standard hardware.

References

Ali, A., Shamsuddin, S.M., Ralescu, A.: Classification with class imbalance problem: a review. Int. J. Adv. Soft Comput. Appl. 7(3), 176–204 (2015)
Google Scholar
Basgall, M.J., Hasperué, W., Naiouf, M., Fernández, A., Herrera, F.: SMOTE-BD: an exact and scalable oversampling method for imbalanced classification in big data. In: VI Jornadas de Cloud Computing & Big Data (JCC&BD) (La Plata 2018) (2018)
Google Scholar
Basgall, M.J., Hasperué, W., Naiouf, M., Fernández, A., Herrera, F.: An analysis of local and global solutions to address big data imbalanced classification: a case study with SMOTE preprocessing. In: Naiouf, M., Chichizola, F., Rucci, E. (eds.) JCC&BD 2019. CCIS, vol. 1050, pp. 75–85. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27713-0_7
Chapter Google Scholar
Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 14(106), 1–16 (2013)
Google Scholar
Branco, P., Torgo, L., Ribeiro, R.P.: A survey of predictive modelling under imbalanced distributions. CoRR abs/1505.01658 (2015). http://arxiv.org/abs/1505.01658
Brennan, P.: A comprehensive survey of methods for overcoming the class imbalance problem in fraud detection. Master’s thesis, Institute of Technology Blanchardstown, Dublin, Ireland (2012)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Elreedy, D., Atiya, A.F.: A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Inf. Sci. 505, 32–64 (2019)
Article Google Scholar
Fernández, A., García, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, markin the 15-year anniversary. J. Artif. Intell. Res. 51, 863–905 (2018)
Article Google Scholar
García, V., Alejo, R., Sánchez, J.S., Sotoca, J.M., Mollineda, R.A.: Combined effects of class imbalance and class overlap on instance-based classification. In: Corchado, E., Yin, H., Botti, V., Fyfe, C. (eds.) IDEAL 2006. LNCS, vol. 4224, pp. 371–378. Springer, Heidelberg (2006). https://doi.org/10.1007/11875581_45
Chapter Google Scholar
Hassib, E.M., El-Desouky, A.I., Labib, L.M., El-kenawy, E.S.M.: WOA + BRNN: an imbalanced big data classification framework using whale optimization and deep neural network. Soft. Comput. 24(8), 5573–5592 (2020)
Article Google Scholar
Jain, A., Ratnoo, S., Kumar, D.: Addressing class imbalance problem in medical diagnosis: a genetic algorithm approach. In: 2017 International Conference on Information, Communication, Instrumentation and Control (ICICIC), pp. 1–8 (2017)
Google Scholar
Japkowicz, N., Shah, M.: Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, Cambridge (2011)
Google Scholar
Joyanes Aguilar, L.: Big Data: Análisis de grandes volúmenes de datos en organizaciones. Alfaomega (2013)
Google Scholar
Kovács, G.: SMOTE-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366, 352–354 (2019)
Article Google Scholar
Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A., Seliya, N.: A survey on addressing high-class imbalance in big data. J. Big Data 5(1), 1–30 (2018). https://doi.org/10.1186/s40537-018-0151-6
Article Google Scholar
Maillo, J., Ramírez, S., Triguero, I., Herrera, F.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2017)
Article Google Scholar
Maillo, J., Triguero, I., Herrera, F.: Redundancy and complexity metrics for big data classification: towards smart data. IEEE Access 8, 87918–87928 (2020)
Article Google Scholar
Maldonado, S., López, J., Vairetti, C.: An alternative SMOTE oversampling strategy for high-dimensional datasets. Appl. Soft Comput. 76, 380–389 (2019)
Article Google Scholar
Pengfei, J., Chunkai, Z., Zhenyu, H.: A new sampling approach for classification of imbalanced data sets with high density. In: 2014 International Conference on Big Data and Smart Computing (BIGCOMP), pp. 217–222 (2014)
Google Scholar
Saez, J.A., Galar, M., Krawczyk, B.: Addressing the overlapping data problem in classification using the one-vs-one decomposition strategy. IEEE Access 7, 83396–83411 (2019)
Article Google Scholar
Sleeman, W.C., IV., Krawczyk, B.: Multi-class imbalanced big data classification on spark. Knowl.-Based Syst. 212, 106598 (2021)
Article Google Scholar
Suárez, J.L., García, S., Herrera, F.: A tutorial on distance metric learning: mathematical foundations, algorithms, experimental analysis, prospects and challenges. Neurocomputing 425, 300–322 (2021)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Doctorado en Ciencias de la Ingenería Avanzada,Institituto de Ingeniería y Tecnología, Universidad Autónoma de Ciudad Juárez, Chihuahua, Mexico
A. Bolívar
División Multidisciplinaria en Ciudad Universitaria, Universidad Autónoma de Ciudad Juárez, Chihuahua, Mexico
V. García, R. Florencia, G. Rivera & J. Patricia Sánchez-Solís
Tecnológico Nacional de México, IT Toluca, Metepec, Mexico
R. Alejo

Authors

A. Bolívar
View author publications
You can also search for this author in PubMed Google Scholar
V. García
View author publications
You can also search for this author in PubMed Google Scholar
R. Florencia
View author publications
You can also search for this author in PubMed Google Scholar
R. Alejo
View author publications
You can also search for this author in PubMed Google Scholar
G. Rivera
View author publications
You can also search for this author in PubMed Google Scholar
J. Patricia Sánchez-Solís
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Bolívar .

Editor information

Editors and Affiliations

Universidad Autónoma de Ciudad Juárez, Ciudad Juárez, Mexico
Osslan Osiris Vergara-Villegas
Universidad Autónoma de Ciudad Juárez, Ciudad Juárez, Mexico
Vianey Guadalupe Cruz-Sánchez
Instituto Politécnico Nacional, Mexico City, Mexico
Juan Humberto Sossa-Azuela
Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, Mexico
Jesús Ariel Carrasco-Ochoa
Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, Mexico
José Francisco Martínez-Trinidad
Autonomous University of Puebla, Puebla, Mexico
José Arturo Olvera-López

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bolívar, A., García, V., Florencia, R., Alejo, R., Rivera, G., Sánchez-Solís, J.P. (2022). A Preliminary Study of SMOTE on Imbalanced Big Datasets When Dealing with Sparse and Dense High Dimensionality. In: Vergara-Villegas, O.O., Cruz-Sánchez, V.G., Sossa-Azuela, J.H., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera-López, J.A. (eds) Pattern Recognition. MCPR 2022. Lecture Notes in Computer Science, vol 13264. Springer, Cham. https://doi.org/10.1007/978-3-031-07750-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-07750-0_5
Published: 11 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07749-4
Online ISBN: 978-3-031-07750-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Preliminary Study of SMOTE on Imbalanced Big Datasets When Dealing with Sparse and Dense High Dimensionality

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MaMiPot: a paradigm shift for the classification of imbalanced data

Performance Analysis of Machine Learning Algorithms on Imbalanced Datasets Using SMOTE Technique

Types of minority class examples and their influence on learning classifiers from imbalanced data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

A Preliminary Study of SMOTE on Imbalanced Big Datasets When Dealing with Sparse and Dense High Dimensionality

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MaMiPot: a paradigm shift for the classification of imbalanced data

Performance Analysis of Machine Learning Algorithms on Imbalanced Datasets Using SMOTE Technique

Types of minority class examples and their influence on learning classifiers from imbalanced data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation