[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

A Preliminary Study of SMOTE on Imbalanced Big Datasets When Dealing with Sparse and Dense High Dimensionality

  • Conference paper
  • First Online:
Pattern Recognition (MCPR 2022)

Abstract

The interest in exploiting big datasets with machine learning has led to adapting classic strategies in this new paradigm determined by volume, speed, and variety. Because data quality is a determining factor in constructing a classifier, it has also been necessary to adapt or develop new data preprocessing techniques. One of the challenges of most significant interest is the class imbalance problem, where the class of interest has a smaller number of examples concerning another class called the majority. To alleviate this problem, one of the most recognized techniques is SMOTE, which is characterized by generating instances of the minority class through a process that uses the nearest neighbor rule and the Euclidean distance. Various articles have shown that SMOTE is not appropriate for datasets with high dimensionality. However, in big data, datasets with high dimensionality have contained many zeros. Therefore, in this article, our objective is to analyze the SMOTE-BD behavior on imbalanced big datasets with sparse and dense dimensionality. Experimental results using two classifiers and big datasets with different dimensionalities suggest that sparsity is a predominant factor than the dimensionality in the behavior of SMOTE-BD.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 51.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 64.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Although this medium-high dimensional dataset may not represent a big data problem in terms of volume, we believe it can be treated as such since it may not be processed and analyzed on standard hardware.

References

  1. Ali, A., Shamsuddin, S.M., Ralescu, A.: Classification with class imbalance problem: a review. Int. J. Adv. Soft Comput. Appl. 7(3), 176–204 (2015)

    Google Scholar 

  2. Basgall, M.J., Hasperué, W., Naiouf, M., Fernández, A., Herrera, F.: SMOTE-BD: an exact and scalable oversampling method for imbalanced classification in big data. In: VI Jornadas de Cloud Computing & Big Data (JCC&BD) (La Plata 2018) (2018)

    Google Scholar 

  3. Basgall, M.J., Hasperué, W., Naiouf, M., Fernández, A., Herrera, F.: An analysis of local and global solutions to address big data imbalanced classification: a case study with SMOTE preprocessing. In: Naiouf, M., Chichizola, F., Rucci, E. (eds.) JCC&BD 2019. CCIS, vol. 1050, pp. 75–85. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27713-0_7

    Chapter  Google Scholar 

  4. Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 14(106), 1–16 (2013)

    Google Scholar 

  5. Branco, P., Torgo, L., Ribeiro, R.P.: A survey of predictive modelling under imbalanced distributions. CoRR abs/1505.01658 (2015). http://arxiv.org/abs/1505.01658

  6. Brennan, P.: A comprehensive survey of methods for overcoming the class imbalance problem in fraud detection. Master’s thesis, Institute of Technology Blanchardstown, Dublin, Ireland (2012)

    Google Scholar 

  7. Chang, C.C., Lin, C.J.: LIBSVM. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)

    Article  Google Scholar 

  8. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  9. Elreedy, D., Atiya, A.F.: A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Inf. Sci. 505, 32–64 (2019)

    Article  Google Scholar 

  10. Fernández, A., García, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, markin the 15-year anniversary. J. Artif. Intell. Res. 51, 863–905 (2018)

    Article  Google Scholar 

  11. García, V., Alejo, R., Sánchez, J.S., Sotoca, J.M., Mollineda, R.A.: Combined effects of class imbalance and class overlap on instance-based classification. In: Corchado, E., Yin, H., Botti, V., Fyfe, C. (eds.) IDEAL 2006. LNCS, vol. 4224, pp. 371–378. Springer, Heidelberg (2006). https://doi.org/10.1007/11875581_45

    Chapter  Google Scholar 

  12. Hassib, E.M., El-Desouky, A.I., Labib, L.M., El-kenawy, E.S.M.: WOA + BRNN: an imbalanced big data classification framework using whale optimization and deep neural network. Soft. Comput. 24(8), 5573–5592 (2020)

    Article  Google Scholar 

  13. Jain, A., Ratnoo, S., Kumar, D.: Addressing class imbalance problem in medical diagnosis: a genetic algorithm approach. In: 2017 International Conference on Information, Communication, Instrumentation and Control (ICICIC), pp. 1–8 (2017)

    Google Scholar 

  14. Japkowicz, N., Shah, M.: Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, Cambridge (2011)

    Google Scholar 

  15. Joyanes Aguilar, L.: Big Data: Análisis de grandes volúmenes de datos en organizaciones. Alfaomega (2013)

    Google Scholar 

  16. Kovács, G.: SMOTE-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366, 352–354 (2019)

    Article  Google Scholar 

  17. Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A., Seliya, N.: A survey on addressing high-class imbalance in big data. J. Big Data 5(1), 1–30 (2018). https://doi.org/10.1186/s40537-018-0151-6

    Article  Google Scholar 

  18. Maillo, J., Ramírez, S., Triguero, I., Herrera, F.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2017)

    Article  Google Scholar 

  19. Maillo, J., Triguero, I., Herrera, F.: Redundancy and complexity metrics for big data classification: towards smart data. IEEE Access 8, 87918–87928 (2020)

    Article  Google Scholar 

  20. Maldonado, S., López, J., Vairetti, C.: An alternative SMOTE oversampling strategy for high-dimensional datasets. Appl. Soft Comput. 76, 380–389 (2019)

    Article  Google Scholar 

  21. Pengfei, J., Chunkai, Z., Zhenyu, H.: A new sampling approach for classification of imbalanced data sets with high density. In: 2014 International Conference on Big Data and Smart Computing (BIGCOMP), pp. 217–222 (2014)

    Google Scholar 

  22. Saez, J.A., Galar, M., Krawczyk, B.: Addressing the overlapping data problem in classification using the one-vs-one decomposition strategy. IEEE Access 7, 83396–83411 (2019)

    Article  Google Scholar 

  23. Sleeman, W.C., IV., Krawczyk, B.: Multi-class imbalanced big data classification on spark. Knowl.-Based Syst. 212, 106598 (2021)

    Article  Google Scholar 

  24. Suárez, J.L., García, S., Herrera, F.: A tutorial on distance metric learning: mathematical foundations, algorithms, experimental analysis, prospects and challenges. Neurocomputing 425, 300–322 (2021)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Bolívar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bolívar, A., García, V., Florencia, R., Alejo, R., Rivera, G., Sánchez-Solís, J.P. (2022). A Preliminary Study of SMOTE on Imbalanced Big Datasets When Dealing with Sparse and Dense High Dimensionality. In: Vergara-Villegas, O.O., Cruz-Sánchez, V.G., Sossa-Azuela, J.H., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera-López, J.A. (eds) Pattern Recognition. MCPR 2022. Lecture Notes in Computer Science, vol 13264. Springer, Cham. https://doi.org/10.1007/978-3-031-07750-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-07750-0_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-07749-4

  • Online ISBN: 978-3-031-07750-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics