[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

SVDD boundary and DPC clustering technique-based oversampling approach for handling imbalanced and overlapped data

Published: 25 December 2021 Publication History

Abstract

Imbalanced datasets classification remains an important domain in machine learning. Conventional supervised learning algorithms tend to be biased towards the majority class when addressing imbalanced datasets, thus providing poor classification results on the minority class. The learning task would become more challenging when there are overlapping and within-imbalance issues in imbalanced datasets, which are often the case and have been proven to play more negative impact on the classification performance than between-class imbalance. In this paper, we propose a novel SVDD (Support Vector Data Description) boundary and DPC (Density Peaks Clustering) clustering technique-based oversampling approach (SVDDDPCO) for handling imbalanced and overlapped data. The proposed approach first utilizes support vector data description (SVDD) model with greater penalty constant for the minority class than the majority class to generate the class boundary, and then identifies those misclassified majority or few minority instances by the boundary as potential overlapped or noisy ones and eliminates them. To address the within-balance issues, density peaks clustering (DPC) is used to cluster the minority instances due to its advantage of accurately identifying sub-clusters with different sizes and densities, which facilitates simultaneously combating between-class and within-class imbalance issues caused by various reasons. The minority instances are assigned with different weights inversely proportional to their local densities and their distances to the boundary and then the size for each identified sub-cluster to be oversampled is adaptively determined according to its own size and the weights of its elements. Such strategy aims to generate more synthetic minority instances for borderline and sparser small sub-clusters which are usually informative to the later learning tasks. Finally, oversampling is performed within each sub-cluster with different sizes to not only counteract the within imbalance but also avoid the generation of any noisy or overlapped synthetic instance. Extensive comparison results on various datasets showed the proposed method achieves statistically significant improvements in classification performance relative to state-of-the-art ones.

References

[1]
Tao X.M., Li Q., Guo W.J., Ren C., Li C.X., Liu R., Zou J.R., Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification, Inform. Sci. 487 (2019) 31–56,.
[2]
Tsai C.F., Lin W.C., Hu Y.H., Yao G.T., Under-sampling class imbalanced datasets by combining clustering analysis and instance selection, Inform. Sci. 477 (2019) 47–54,.
[3]
Shilaskar S., Ghatol A., Chatur P., Medical decision support system for extremely imbalanced datasets, Inform. Sci. 384 (2017) 205–219,.
[4]
Hassan M.M., Huda S., Yearwood J., Jelinek H.F., Almogren A., Multistage fusion approaches based on a generative model and multivariate exponentially weighted moving average for diagnosis of cardiovascular autonomic nerve dysfunction, Inform. Sci. 41 (2018) 105–118,.
[5]
Han S., Choi H.J., Choi S.K., Oh J.S., Fault diagnosis of planetary gear carrier packs: A class imbalance and multiclass classification problem, Int. J. Precis. Eng. Manuf. 20 (2) (2019) 167–179,.
[6]
Tan X.P., Su S.J., Huang Z.P., Guo X.J., Zuo Z., Sun X.Y., Li L.Q., Wireless sensor networks intrusion detection based on SMOTE and the random forest algorithm, Sensors 19 (1) (2019),.
[7]
Irtazal A., Adnan S.M., Ahmed K.T., Jaffar A., Khan A., Javed A., Mahmood M.T., An ensemble based evolutionary approach to the class imbalance problem with applications in CBIR, Appl. Sci.-Basel. 8 (4) (2018),.
[8]
Fiore U., De Santis A., Perla F., Zanetti P., Palmieri F., Using generative adversarial networks for improving classification effectiveness in credit card fraud detection, Inform. Sci. 479 (2019) 448–455,.
[9]
Li Y.J., Guo H.X., Zhang Q.P., Gu M.Y., Yang J.Y., Imbalanced text sentiment classification using universal and domain-specific knowledge, Knowl.-Based Syst. 160 (2018) 1–15,.
[10]
Thammasiri D., Delen D., Meesad P., Kasap N., A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition, Expert Syst. Appl. 41 (2) (2014) 321–330,.
[11]
Diez-Pastor J.F., Rodriguez J.J., Garcia-Osorio C.I., Kuncheva L.I., Diversity techniques improve the performance of the best imbalance learning ensembles, Inform. Sci. 325 (2015) 98–117,.
[12]
Lopez V., Fernandez A., Garcia S., Palade V., Herrera F., An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics, Inform. Sci. 250 (2013) 113–141,.
[13]
Guo H.X., Li Y.J., Shang J., Learning from class-imbalanced data: Review of methods and applications, Expert Syst. Appl. 73 (2017) 220–239,.
[14]
García V., Sánchez J.S., Ochoa Domínguez H.J., Cleofas-Sánchez L., Dissimilarity-based learning from imbalanced data with small disjuncts and noise, in: Pattern Recognition and Image Analysis, Springer, 2015.
[15]
Das S., Datta S., Chaudhuri B.B., Handling data irregularities in classification: foundations, trends, and future challenges, Pattern Recognit. 81 (2018) 674–693,.
[16]
Stefanowski J., Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data, in: Emerging Paradigms in Machine Learning, Springer, 2013, pp. 277–306.
[17]
Visa S., Ralescu A., Learning imbalanced and overlapping classes using fuzzy sets, Proc ICML 3 (2003).
[18]
Alejo R., A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios, Pattern Recognit. Lett. 34 (2013) 380–388,.
[19]
R.C. Holte, L.E. Acker, B.W. Porter, Concept learning and the problem of small disjuncts. in: Proceedings of 11th International Joint Conference on Artificial Intelligence, 1989, pp. 813–818.
[20]
David M.J., Tax, support vector data description, Mach. Learn. 54 (2004) 45–66,.
[21]
Tao X.M., Chen W., Li X.K., Zhang X.H., Li Y.T., Guo J., The ensemble of density-sensitive SVDD classifier based on maximum soft margin for imbalanced datasets, 2021, pp. 1–21,.
[22]
Rodriguez A., Laio A., Clustering by fast search and find of density peaks, Science 344 (6191) (2014) 1492–1496,.
[23]
He H., Garcia E.A., Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21 (2009) 1263–1284,.
[24]
Lee H.K., Kim S.B., An overlap-sensitive margin classifier for imbalanced and overlapping data, Expert Syst. Appl. 98 (2018) 72–83,.
[25]
Siers M.J., Islam M.Z., Novel algorithms for cost-sensitive classification and knowledge discovery in class imbalanced datasets with an application to NASA software defects, Inform. Sci. 459 (2018) 53–70,.
[26]
Raghuwanshi B.S., Shukla S., UnderBagging based reduced Kernelized weighted extreme learning machine for class imbalance learning, Eng. Appl. Artif. Intell. 74 (2018) 252–270,.
[27]
Chawla N.V., Lazarevic A., Hall L.O., Bowyer K.W., SMOTEBoost: Improving prediction of the minority class in boosting, in: Knowledge Discovery in Databases: PKDD 2003, Springer, 2003, pp. 107–119,.
[28]
Sun J., Lang J., Fujita H., Li H., Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates, Inform. Sci. 425 (2018) 76–91,.
[29]
Seiffert C., Khoshgoftaar T.M., Van Hulse J., Napolitano A., RUSBoost: A hybrid approach to alleviating class imbalance, IEEE Trans. Syst. Man Cybern. A 40 (1) (2010) 185–197,.
[30]
Tao X.M., Li Q., Ren C., Guo W.J., Li C.X., He Q., Liu R., Zou J.R., Real-value negative selection over-sampling for imbalanced dataset learning, Expert Syst. Appl. 129 (2019) 118–134,.
[31]
Chen Z.X., Yan Q.B., Han H.B., Wang S.S., Peng L.Z., Wang L., Yang B., Machine learning based mobile malware detection using highly imbalanced network traffic, Inform. Sci. 433 (2018) 346–364,.
[32]
Douzas G., Bacao F., Last F., Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE, Inform. Sci. 465 (2018) 1–20,.
[33]
Douzas G., Bacao F., Self-organizing map oversampling (SOMO) for imbalanced data set learning, Expert Syst. Appl. 82 (2017) 40–52,.
[34]
He H., Bai Y., Garcia E., Li S., ADASYN: adaptive synthetic sampling approach for imbalanced learning, in: Proceedings of the IEEE International Joint Conference on Neural Networks, IEEE, 2008, pp. 1322–1328,.
[35]
Batista G., Prati R.C., Monard M.C., A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explor. Newslett 6 (1) (2004) 20–29,.
[36]
Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P., SMOTE: synthetic minority over-sampling technique, J. Artificial Intelligence Res. 16 (2002) 321–357,.
[37]
Bunkhumpornpat C., Sinapiromsaran K., Lursinsap C., Safe-Level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, in: Advances in Knowledge Discovery and Data Mining, Springer, 2009, pp. 475–482,.
[38]
Bunkhumpornpat C., Sinapiromsaran K., Lursinsap C., DBSMOTE: density-based synthetic minority over-sampling technique, Appl. Intell. 36 (3) (2012) 664–684,.
[39]
Han H., Wang W.Y., Mao B.H., Borderline-smote: a new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing, Springer, 2005, pp. 878–887.
[40]
Barua S., Islam M.M., Yao X., Murase K., MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng. 26 (2) (2014) 405–425,.
[41]
Denil M., Trappenberg T., Overlap versus imbalance, in: Lecture Notes in Computer Science, in: Lecture Notes in Computer Science, 2010, pp. 220–231,.
[42]
Guo J.P., Ma H.X., A generalized mean distance-based k-nearest neighbor classifier, Expert Syst. Appl. 115 (2019) 356–372,.
[43]
Wilson D.L., Asymptotic properties of nearest neighbor rules using edited data, IEEE Trans. Syst. Man Cybern. 3 (1972) 408–421.
[44]
Laurikkala J., Improving identification of difficult small classes by balancing class distribution, in: Conference on Artificial Intelligence in Medicine in Europe, Springer, 2001, pp. 63–66,.
[45]
Nigeria K., Stefanowski J., Wilk S., Learning from imbalanced data in presence of noisy and borderline examples, Lecture Notes in Comput. Sci. 6086 (2010) 158–167.
[46]
Vuttipittayamongkol P., Elyan E., Neighbourhood-based undersampling approach for handling imbalanced and overlapped data, Inform. Sci. 509 (2020) 47–70,.
[47]
Vuttipittayamongkol P., Elyan E., Petrovski A., Overlap-Based Undersampling for Improving Imbalanced Data Classification, Springer Nature Switzerland AG, 2018, pp. 689–697.
[48]
Bunkhumpornpat C., Sinapiromsaran K., DBMUTE: density-based majority under-sampling technique, Knowl. Inf. Syst. 50 (2016) 827–850,.
[49]
Yen S.J., Lee Y.S., Cluster-based under-sampling approaches for imbalanced data distributions, Expert Syst. Appl. 36 (2009) 5718–5727,.
[50]
Lin W.C., Tsai C.F., Hu Y.H., Jhang J.S., Clustering-based undersampling in class-imbalanced data, Inform. Sci. 409 (2017) 17–26,.
[51]
Ofek N., Rokach L., Stern R., Shabtai A., Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem, Neurocomputing 243 (2017) 88–102,.
[52]
Cieslak D.A., Chawla N.V., Striegel A., Combating imbalance in network intrusion datasets, in: Proceedings of the IEEE International Conference on Granular Computing, IEEE, 2006, pp. 732–737,.
[53]
Jo T., Japkowicz N., Class imbalances versus small disjuncts, ACM SIGKDD Explor. Newsl. 6 (2004) 40–49,.
[54]
Santos M.S., Abreu P.H., Garcia-Laencina P.J., Simao A., Carvalho A., A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients, J. Biomed. Inform. 58 (2015) 49–59,.
[55]
Cieslak D.A., Chawla N.V., Start globally, optimize locally, predict globally: Improving performance on imbalanced data, in: Eighth IEEE International Conference on Data Mining, ICDM, 2008, pp. 143–152,.
[56]
Song J., Huang X., Qin S., Song Q., A bi-directional sampling based on k-means method for imbalance text classification, in: Proceedings of the International Conference on Computer and Information Science (ICIS), 2016, pp. 1–5,.
[57]
Ma L., Fan S.H., CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests, BMC Bioinformatics 18 (2017),.
[58]
Nekooeimehr I., Lai-Yuen S.K., Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets, Expert Syst. Appl. 46 (2016) 405–416,.
[59]
Parmar M., Wang D., Zhang X.F., Tan A.H., Miao C.Y., Jiang J.H., Zhou Y., REDPC: A residual error-based density peak clustering algorithm, Neurocomputing 348 (2019) 82–96,.
[60]
[61]
Jian C.X., Gao J., Ao Y.H., A new sampling method for classifying imbalanced data based on support vector machine ensemble, Neurocomputing 193 (2016) 115–122,.

Cited By

View all

Index Terms

  1. SVDD boundary and DPC clustering technique-based oversampling approach for handling imbalanced and overlapped data
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Knowledge-Based Systems
        Knowledge-Based Systems  Volume 234, Issue C
        Dec 2021
        508 pages

        Publisher

        Elsevier Science Publishers B. V.

        Netherlands

        Publication History

        Published: 25 December 2021

        Author Tags

        1. Imbalanced datasets
        2. Classification
        3. Over-sampling
        4. Overlapping
        5. Within-class imbalance

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 22 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024) CBReTKnowledge-Based Systems10.1016/j.knosys.2024.111390286:COnline publication date: 17-Apr-2024
        • (2024)Density peaks clustering based on superior nodes and fuzzy correlationInformation Sciences: an International Journal10.1016/j.ins.2024.120685672:COnline publication date: 1-Jun-2024
        • (2024)An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisyApplied Intelligence10.1007/s10489-024-05754-x54:22(11430-11449)Online publication date: 1-Nov-2024
        • (2024)An empirical study on the class imbalance handling techniques for different diseasesSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-024-09881-y28:19(11439-11456)Online publication date: 1-Oct-2024
        • (2022)Resampling algorithms based on sample concatenation for imbalance learningKnowledge-Based Systems10.1016/j.knosys.2022.108592245:COnline publication date: 7-Jun-2022

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media