More Web Proxy on the site http://driver.im/

research-article

SDDSMOTE:Synthetic Minority Oversampling Technique based on Sample Density Distribution for Enhanced Classification on Imbalanced Microarray Data

Authors:

Haotian YangAuthors Info & Claims

ICCDA '22: Proceedings of the 2022 6th International Conference on Compute and Data Analysis

Pages 35 - 42

https://doi.org/10.1145/3523089.3523096

Published: 23 May 2022 Publication History

Abstract

Microarray gene expression data contain an unbalanced distribution of data samples among different classes, which poses a challenge to machine learning-based cancer diagnosis. In addition, microarray data consists of small samples and a huge number of genes, which cause the curse of dimensionality. In order to enhance the performance of learning models on imbalanced microarray data, we propose a novel preprocessing method based on the SMOTE, named SDDSMOTE (Synthetic Minority Oversampling Technique based on Sample Density Distribution). The whole preprocessing includes two steps. First, by using a feature selection technology, irrelevant genes are eliminated and obtaining reduced gene data. Second, SDDSMOTE is used to rebalance the reduced data. We performed comprehensive experiments to compare SDDSMOTE with other state-of-the-art Oversampling algorithms using two Support Vector Machine and Logistic Regression on 8 publicly available microarray expression data sets. The experimental results show that SDDSMOTE outperforms compared algorithms in terms of various evaluation criteria, such as Accuracy, F-score, G-mean, and AUC, which indicates its superiority.

References

[1]

Vasighizaker, A., Jalili, S.: C-PUGP: A cluster-based positive unlabeled learning method for disease gene prediction and prioritization. Computational Biology and Chemistry. 76, 23–31 (2018). https://doi.org/10.1016/j.compbiolchem.2018.05.022.

[2]

Kubra Tuncal, Boran Sekeroglu, and Cagri Ozkan: Lung Cancer Incidence Prediction Using Machine Learning Algorithms. Journal of Advances in Information Technology, Vol. 11, No. 2, pp. 91-96, May 2020. https://doi.org/10.12720/jait.11.2.91-96.

[3]

Gökmen Zararsiz, Selcuk Korkmaz, Dincer Goksuluk, Vahap Eldem, and Ahmet Ozturk: Diagonal Discriminant Analysis for Gene-Expression Based Tumor Classification. Vol. 6, No. 2, pp. 59-62, May, 2015. https://doi.org/10.12720/jait.6.2.59-62.

[4]

Peera Liewlom: Class-Association-Rules Pruning by the Profitability-of-Interestingness Measure: Case Study of an Imbalanced Class Ratio in a Breast Cancer Dataset. Journal of Advances in Information Technology, Vol. 12, No. 3, pp. 246-252, August 2021. https://doi.org/10.12720/jait.12.3.246-252.

[5]

Han, H., Wang, W., Mao, B.: Borderline-SMOTE: A New Over-sampling Method in Imbalanced Data Sets Learning. Presented at the Proceedings of Advances in Intelligent Computing January 1 (2005).

Digital Library

[6]

He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of IJCNN. pp. 1322–1328 (2008).

[7]

Stanfill, C., Waltz, D.: Toward Memory-Based Reasoning. Commun. ACM. 29, 1213–1228 (1986). https://doi.org/10.1145/7902.7906.

Digital Library

[8]

Li, K., Zhang, W., Lu, Q., Fang, X.: An improved SMOTE imbalanced data classification method based on support degree. In: 2014 international conference on identification, information and knowledge in the internet of things. pp. 34–38 (2014). https://doi.org/10.1109/IIKI.2014.14.

Digital Library

[9]

Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6, 1–6 (2004). https://doi.org/10.1145/1007730.1007733.

Digital Library

[10]

García, S., Herrera, F.: Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy. Evolutionary Computation. 17, 275–306 (2009). https://doi.org/10.1162/evco.2009.17.3.275.

Digital Library

[11]

Chawla, N.V.: Data Mining for Imbalanced Datasets: An Overview. In: Maimon, O. and Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook. pp. 875–886. Springer US, Boston, MA (2010). https://doi.org/10.1007/978-0-387-09823-4_45.

[12]

Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews). 42, 463–484 (2012). https://doi.org/10.1109/TSMCC.2011.2161285.

Digital Library

[13]

Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. (JAIR). 16, 321–357 (2002). https://doi.org/10.1613/jair.953.

[14]

Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 14, 106 (2013). https://doi.org/10.1186/1471-2105-14-106.

[15]

Song, K., Yan, F., Ding, T., Gao, L., Lu, S.: A steel property optimization model based on the XGBoost algorithm and improved PSO. Computational Materials Science. 174, 109472 (2020). https://doi.org/10.1016/j.commatsci.2019.109472.

[16]

Lazar, C., Taminau, J., Meganck, S., Steenhoff, D., Coletta, A., Molter, C., de Schaetzen, V., Duque, R., Bersini, H., Nowe, A.: A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis. IEEE/ACM Trans. Comput. Biol. and Bioinf. 9, 1106–1119 (2012). https://doi.org/10.1109/TCBB.2012.33.

Digital Library

[17]

Ben Brahim, A., Limam, M.: Robust ensemble feature selection for high dimensional data sets. In: 2013 International Conference on High Performance Computing & Simulation (HPCS). pp. 151–157. IEEE, Helsinki, Finland (2013). https://doi.org/10.1109/HPCSim.2013.6641406.

[18]

Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., Liu, H.: Feature Selection: A Data Perspective. ACM Comput. Surv. 50, 1–45 (2018). https://doi.org/10.1145/3136625.

Digital Library

[19]

Huang, X., Zhang, L., Wang, B., Li, F., Zhang, Z.: Feature clustering based support vector machine recursive feature elimination for gene selection. Appl Intell. 48, 594–607 (2018). https://doi.org/10.1007/s10489-017-0992-2.

Digital Library

[20]

Kononenko, I.: Estimating attributes: Analysis and extensions of RELIEF. In: Bergadano, F. and Raedt, L. (eds.) Machine Learning: ECML-94. pp. 171–182. Springer Berlin Heidelberg, Berlin, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4_57.

Digital Library

[21]

NIPS 2003 workshop on feature extraction and feature selection challenge, http://clopinet.com/isabelle/Projects/NIPS2003/, last accessed 2021/06/28.

[22]

Zhu, Z., Ong, Y.-S., Dash, M.: Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognition. 40, 3236–3248 (2007). https://doi.org/10.1016/j.patcog.2007.02.007.

Digital Library

[23]

Borovecki, F., Lovrecic, L., Zhou, J., Jeong, H., Then, F., Rosas, H.D., Hersch, S.M., Hogarth, P., Bouzou, B., Jensen, R.V., Krainc, D.: Genome-wide expression profiling of human blood reveals biomarkers for Huntington's disease. Proceedings of the National Academy of Sciences of the United States of America. 102, 11023–11028 (2005).

[24]

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 286, 531–537 (1999). https://doi.org/10.1126/science.286.5439.531.

[25]

Bhattacharjee, A., Richards, W.G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E.J., Lander, E.S., Wong, W., Johnson, B.E., Golub, T.R., Sugarbaker, D.J., Meyerson, M.: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. PNAS. 98, 13790–13795 (2001). https://doi.org/10.1073/pnas.191502998.

[26]

Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., Mesirov, J.P.: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 102, 15545–15550 (2005). https://doi.org/10/d4qbh8.

[27]

Yu, H., Ni, J.: An Improved Ensemble Learning Method for Classifying High-Dimensional and Imbalanced Biomedicine Data. IEEE/ACM Transactions on Computational Biology & Bioinformatics. 11, 657–666 (2014).

Digital Library

[28]

Ri, J., Kim, H.: G-mean based extreme learning machine for imbalance learning. Digital Signal Processing. 98, 102637 (2020). https://doi.org/10.1016/j.dsp.2019.102637.

Digital Library

[29]

Li, M., Xiong, A., Wang, L., Deng, S., Ye, J.: ACO Resampling: Enhancing the performance of oversampling methods for class imbalance classification. Knowledge-Based Systems. 196, 105818 (2020).

[30]

Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning. 20, 273–297 (1995). https://doi.org/10.1023/A:1022627411411.

Digital Library

[31]

Rice, J.C.: Logistic regression: An introduction. Advances in Social Science Methodology. 3, 191–245 (1994).

[32]

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Müller, A., Nothman, J., Louppe, G.: Scikit-learn: Machine Learning in Python. (2012).

[33]

Kovács, G.: An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Applied Soft Computing. 83, 105662 (2019). https://doi.org/10.1016/j.asoc.2019.105662.

Digital Library

Cited By

Bae WAlkobaisi SBhuvaji SBankar S(2024)SDGnE: A Synthetic Data Generation and Evaluation System for Rare Event PredictionDatabase Systems for Advanced Applications10.1007/978-981-97-5575-2_49(508-512)Online publication date: 2-Sep-2024
https://doi.org/10.1007/978-981-97-5575-2_49
Bae WAlkobaisi SBankar SBhuvaji SSinghvi JIrukulla MMcDonnell W(2024)Incremental SMOTE with Control Coefficient for Classifiers in Data Starved Medical ApplicationsBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_9(112-119)Online publication date: 18-Aug-2024
https://doi.org/10.1007/978-3-031-68323-7_9
Bae WAlfonso AStanko DHao LLe LHorak M(2023)Improving Classification Performance on Rare Events in Data Starved Medical Applications2023 IEEE International Symposium on Medical Measurements and Applications (MeMeA)10.1109/MeMeA57477.2023.10314855(1-6)Online publication date: 14-Jun-2023
https://doi.org/10.1109/MeMeA57477.2023.10314855
Show More Cited By

Recommendations

MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning

Imbalanced learning problems contain an unequal distribution of data samples among different classes and pose a challenge to any classifier as it becomes hard to learn the minority class samples. Synthetic oversampling methods address this problem by ...
Distance-based arranging oversampling technique for imbalanced data
Abstract
Class imbalance data sets are common in a vast variety of real-world application areas. Synthetic minority oversampling technique (SMOTE) is an important technique for processing imbalanced data sets. SMOTE requires the user to preset the number ...
Whale Optimization-based Synthetic Minority Oversampling Technique for Binary Imbalanced Datasets
Abstract
The problem of class imbalance has become a predominant area of research recently. Synthetic Minority Oversampling Technique (SMOTE) stands as a popular and widely adopted oversampling technique that effectively addresses the challenge of class ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICCDA '22: Proceedings of the 2022 6th International Conference on Compute and Data Analysis

February 2022

131 pages

ISBN:9781450395472

DOI:10.1145/3523089

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 May 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICCDA 2022

ICCDA 2022: 2022 The 6th International Conference on Compute and Data Analysis

February 25 - 27, 2022

Shanghai, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
61
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)4

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bae WAlkobaisi SBhuvaji SBankar S(2024)SDGnE: A Synthetic Data Generation and Evaluation System for Rare Event PredictionDatabase Systems for Advanced Applications10.1007/978-981-97-5575-2_49(508-512)Online publication date: 2-Sep-2024
https://doi.org/10.1007/978-981-97-5575-2_49
Bae WAlkobaisi SBankar SBhuvaji SSinghvi JIrukulla MMcDonnell W(2024)Incremental SMOTE with Control Coefficient for Classifiers in Data Starved Medical ApplicationsBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_9(112-119)Online publication date: 18-Aug-2024
https://doi.org/10.1007/978-3-031-68323-7_9
Bae WAlfonso AStanko DHao LLe LHorak M(2023)Improving Classification Performance on Rare Events in Data Starved Medical Applications2023 IEEE International Symposium on Medical Measurements and Applications (MeMeA)10.1109/MeMeA57477.2023.10314855(1-6)Online publication date: 14-Jun-2023
https://doi.org/10.1109/MeMeA57477.2023.10314855
Mahfouz M(2023)Classification of tumors based on distinguishing possibilistic biclusters2023 Intelligent Methods, Systems, and Applications (IMSA)10.1109/IMSA58542.2023.10217488(333-338)Online publication date: 15-Jul-2023
https://doi.org/10.1109/IMSA58542.2023.10217488
Bae WAlkobaisi SHorak MPark CKim SDavidson J(2022)Predicting Health Risks of Adult Asthmatics Susceptible to Indoor Air Quality Using Improved Logistic and Quantile Regression ModelsLife10.3390/life1210163112:10(1631)Online publication date: 18-Oct-2022
https://doi.org/10.3390/life12101631

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents