[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Counterfactual-based minority oversampling for imbalanced classification

Published: 01 June 2023 Publication History

Abstract

A key challenge of oversampling in imbalanced classification is that the generation of new minority samples often neglects the usage of majority classes, resulting in most new minority sampling spreading the whole minority space. In view of this, we present a new oversampling framework based on the counterfactual theory. Our framework introduces a counterfactual objective by leveraging the rich inherent information of majority classes and explicitly perturbing majority samples to generate new samples in the territory of minority space. It can be analytically shown that the new minority samples satisfy the minimum inversion. Therefore, most of them are located near the decision boundary. The empirical evaluation of the six benchmark datasets shows that our approach clearly outperforms the state-of-the-art methods.

References

[1]
Abdi L., Hashemi S., To combat multi-class imbalanced problems by means of over-sampling techniques, IEEE Trans. Knowl. Data Eng. 28 (1) (2015) 238–251.
[2]
Adinarayana S., Ilavarasan E., An efficient decision tree for imbalance data learning using confiscate and substitute technique, Mater. Today: Proc. 5 (1) (2018) 680–687.
[3]
Alowibdi J.S., Alshdadi A.A., Daud A., Dessouky M.M., Alhazmi E.A., Coronavirus pandemic (covid-19): emotional toll analysis on twitter, Int. J. Semant. Web Inf. Syst. (IJSWIS) 17 (2) (2021) 1–21.
[4]
Anand A., Gorde K., Moniz J.R.A., Park N., Chakraborty T., Chu B.-T., Phishing URL detection with oversampling based on text generative adversarial networks, in: 2018 IEEE International Conference on Big Data (Big Data), IEEE, Piscataway, NJ, 2018, pp. 1168–1177.
[5]
Ando S., Huang C.Y., Deep over-sampling framework for classifying imbalanced data, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, Berlin, German, 2017, pp. 770–785.
[6]
Asuncion A., Newman D., UCI machine learning repository, 2007.
[7]
Barua S., Islam M.M., Yao X., Murase K., MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng. 26 (2) (2012) 405–425.
[8]
Botev Z., L’Ecuyer P., Simulation from the normal distribution truncated to an interval in the tail, in: 10th EAI International Conference on Performance Evaluation Methodologies & Tools, ACM, 2017, pp. 23–29.
[9]
Branco P., Torgo L., Ribeiro R.P., A survey of predictive modeling on imbalanced domains, ACM Comput. Surv. (CSUR) 49 (2) (2016) 1–50.
[10]
Brzezinski D., Minku L.L., Pewinski T., Stefanowski J., Szumaczuk A., The impact of data difficulty factors on classification of imbalanced and concept drifting data streams, Knowl. Inf. Syst. 63 (6) (2021) 1429–1469.
[11]
Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P., SMOTE: Synthetic minority over-sampling technique, J. Artificial Intelligence Res. 16 (1) (2002) 321–357.
[12]
Chopra M., Singh S.K., Gupta A., Aggarwal K., Gupta B.B., Colace F., Analysis & prognosis of sustainable development goals using big data-based approach during COVID-19 pandemic, Sustain. Technol. Entrepreneurship 1 (2) (2022).
[13]
Damien P., Walker S.G., Sampling truncated normal, beta, and gamma densities, J. Comput. Graph. Statist. 10 (2) (2001) 783–794.
[14]
Douzas G., Bacao F., Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning, Expert Syst. Appl. 82 (2017) 40–52.
[15]
Douzas G., Bacao F., Last F., Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE, Inform. Sci. 465 (2018) 1–20.
[16]
Han, H., Wang, W.Y., Mao, B.H., 2005. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Proceedings of the 2005 International Conference on Advances in Intelligent Computing - Volume Part I.
[17]
He H., Bai Y., Garcia E.A., Li S., ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IEEE, Piscataway, NJ, 2008, pp. 1322–1328.
[18]
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778.
[19]
Hu F., Li H., A novel boundary oversampling algorithm based on neighborhood rough set model: Nrsboundary-SMOTE, Math. Probl. Eng. 2013 (2013).
[20]
Kim B., Kim J., Adjusting decision boundary for class imbalanced learning, IEEE Access 8 (2020) 81674–81685.
[21]
Koziarski M., Wożniak M., CCR: A combined cleaning and resampling algorithm for imbalanced data classification, Int. J. Appl. Math. Comput. Sci. 27 (4) (2017) 727–736.
[22]
Lee H., Kim J., Kim S., Gaussian-based SMOTE algorithm for solving skewed class distributions, Int. J. Fuzzy Logic Intell. Syst. 17 (4) (2017) 229–234.
[23]
Lewis D., Causation, J. Philos. 70 (17) (1974) 556–567.
[24]
Lin W.-C., Tsai C.-F., Hu Y.-H., Jhang J.-S., Clustering-based undersampling in class-imbalanced data, Inform. Sci. 409 (2017) 17–26.
[25]
Ma L., Fan S., CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests, BMC Bioinformatics 18 (1) (2017) 169.
[26]
Masud M., Gaba G.S., Alqahtani S., Muhammad G., Gupta B.B., Kumar P., Ghoneim A., A lightweight and robust secure key establishment protocol for internet of medical things in COVID-19 patients care, IEEE Internet Things J. 8 (21) (2020) 15694–15703.
[27]
Mullick S.S., Datta S., Das S., Generative adversarial minority oversampling, in: Proceedings of the IEEE International Conference on Computer Vision, IEEE, Piscataway, NJ, 2019, pp. 1695–1704.
[28]
Napierala K., Stefanowski J., Types of minority class examples and their influence on learning classifiers from imbalanced data, J. Intell. Inf. Syst. 46 (3) (2016) 563–597.
[29]
Pashchenko D., Fully remote software development due to covid factor: results of industry research (2020), Int. J. Softw. Sci. Comput. Intell. (IJSSCI) 13 (3) (2021) 64–70.
[30]
Pearl J., Causality, Cambridge University Press, United Kingdom, 2009.
[31]
Prusty M.R., Jayanthi T., Velusamy K., Weighted-SMOTE: A modification to SMOTE for event classification in sodium cooled fast reactors, Prog. Nucl. Energy 100 (2017) 355–364.
[32]
Rahman M.A., Hossain M.S., Alrajeh N.A., Gupta B., A multimodal, multimedia point-of-care deep learning framework for COVID-19 diagnosis, ACM Trans. Multimidia Comput. Commun. Appl. 17 (1s) (2021) 1–24.
[33]
Rayhan F., Ahmed S., Mahbub A., Jani R., Shatabda S., Farid D.M., Cusboost: cluster-based under-sampling with boosting for imbalanced classification, in: 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), IEEE, Piscataway, NJ, 2017, pp. 1–5.
[34]
Rivera W.A., Noise Reduction A Priori Synthetic Over-Sampling for class imbalanced data sets, Inform. Sci. 408 (2017) 146–161.
[35]
Rivera W.A., Xanthopoulos P., A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets, Expert Syst. Appl. 66 (2016) 124–135.
[36]
Rong T., Gong H., Ng W.W., Stochastic sensitivity oversampling technique for imbalanced data, in: International Conference on Machine Learning and Cybernetics, Springer, Berlin, German, 2014, pp. 161–171.
[37]
Sedgwick P., Spearman’s rank correlation coefficient, Bmj 349 (2014) g7327.
[38]
Sedik A., Hammad M., El-Samie A., Fathi E., Gupta B.B., El-Latif A., Ahmed A., Efficient deep learning approach for augmented detection of Coronavirus disease, Neural Comput. Appl. 34 (14) (2022) 11423–11440.
[39]
Sharma S., Bellinger C., Krawczyk B., Zaiane O., Japkowicz N., Synthetic oversampling with the majority class: A new perspective on handling extreme imbalance, in: 2018 IEEE International Conference on Data Mining (ICDM), IEEE, Piscataway, NJ, 2018, pp. 447–456.
[40]
Siriseriwan W., Sinapiromsaran K., Adaptive neighbor synthetic minority oversampling technique under 1NN outcast handling, Songklanakarin J. Sci. Technol. 39 (2017) 565–576,.
[41]
Sun C., Shrivastava A., Singh S., Gupta A., Revisiting unreasonable effectiveness of data in deep learning era, in: Proceedings of the IEEE International Conference on Computer Vision, IEEE, Piscataway, NJ, 2017, pp. 843–852.
[42]
Torres F.R., Carrasco-Ochoa J.A., Martínez-Trinidad J.F., SMOTE-D a deterministic version of SMOTE, in: Mexican Conference on Pattern Recognition, Springer, Berlin, German, 2016, pp. 177–188.
[43]
Varian H.R., Big data: New tricks for econometrics, J. Econ. Perspect. 28 (2) (2014) 3–27.
[44]
Vuttipittayamongkol P., Elyan E., Petrovski A., On the class overlap problem in imbalanced data classification, Knowl.-Based Syst. 212 (2021).
[45]
Weiss G.M., McCarthy K., Zabar B., Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs?, Dmin 7 (35–41) (2007) 24.
[46]
Xiao H., Rasul K., Vollgraf R., Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017, arXiv preprint arXiv:1708.07747.
[47]
Zhang J., Lam W.-Y., De Clercq R., A peculiarity in Pearl’s logic of interventionist counterfactuals, J Physiol (London) 42 (5) (2013) 783–794.
[48]
Zhang J., Wang T., Ng W.W.Y., Zhang S., Nugent C.D., Undersampling near decision boundary for imbalance problems, in: 2019 International Conference on Machine Learning and Cybernetics (ICMLC), IEEE, Piscataway, NJ, 2019, pp. 1–8.
[49]
Zhou Z., Gaurav A., Gupta B., Hamdi H., Nedjah N., A statistical approach to secure health care services from DDoS attacks during COVID-19 pandemic, Neural Comput. Appl. (2021) 1–14.
[50]
Zhu Y., Yan Y., Zhang Y., Zhang Y., EHSO: Evolutionary Hybrid Sampling in overlapping scenarios for imbalanced learning, Neurocomputing 417 (2020) 333–346.

Cited By

View all

Index Terms

  1. Counterfactual-based minority oversampling for imbalanced classification
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Engineering Applications of Artificial Intelligence
        Engineering Applications of Artificial Intelligence  Volume 122, Issue C
        Jun 2023
        1605 pages

        Publisher

        Pergamon Press, Inc.

        United States

        Publication History

        Published: 01 June 2023

        Author Tags

        1. Counterfactual
        2. Imbalanced classification
        3. Decision boundary
        4. Oversampling

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 13 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media