[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3523089.3523096acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccdaConference Proceedingsconference-collections
research-article

SDDSMOTE:Synthetic Minority Oversampling Technique based on Sample Density Distribution for Enhanced Classification on Imbalanced Microarray Data

Published: 23 May 2022 Publication History

Abstract

Microarray gene expression data contain an unbalanced distribution of data samples among different classes, which poses a challenge to machine learning-based cancer diagnosis. In addition, microarray data consists of small samples and a huge number of genes, which cause the curse of dimensionality. In order to enhance the performance of learning models on imbalanced microarray data, we propose a novel preprocessing method based on the SMOTE, named SDDSMOTE (Synthetic Minority Oversampling Technique based on Sample Density Distribution). The whole preprocessing includes two steps. First, by using a feature selection technology, irrelevant genes are eliminated and obtaining reduced gene data. Second, SDDSMOTE is used to rebalance the reduced data. We performed comprehensive experiments to compare SDDSMOTE with other state-of-the-art Oversampling algorithms using two Support Vector Machine and Logistic Regression on 8 publicly available microarray expression data sets. The experimental results show that SDDSMOTE outperforms compared algorithms in terms of various evaluation criteria, such as Accuracy, F-score, G-mean, and AUC, which indicates its superiority.

References

[1]
Vasighizaker, A., Jalili, S.: C-PUGP: A cluster-based positive unlabeled learning method for disease gene prediction and prioritization. Computational Biology and Chemistry. 76, 23–31 (2018). https://doi.org/10.1016/j.compbiolchem.2018.05.022.
[2]
Kubra Tuncal, Boran Sekeroglu, and Cagri Ozkan: Lung Cancer Incidence Prediction Using Machine Learning Algorithms. Journal of Advances in Information Technology, Vol. 11, No. 2, pp. 91-96, May 2020. https://doi.org/10.12720/jait.11.2.91-96.
[3]
Gökmen Zararsiz, Selcuk Korkmaz, Dincer Goksuluk, Vahap Eldem, and Ahmet Ozturk: Diagonal Discriminant Analysis for Gene-Expression Based Tumor Classification. Vol. 6, No. 2, pp. 59-62, May, 2015. https://doi.org/10.12720/jait.6.2.59-62.
[4]
Peera Liewlom: Class-Association-Rules Pruning by the Profitability-of-Interestingness Measure: Case Study of an Imbalanced Class Ratio in a Breast Cancer Dataset. Journal of Advances in Information Technology, Vol. 12, No. 3, pp. 246-252, August 2021. https://doi.org/10.12720/jait.12.3.246-252.
[5]
Han, H., Wang, W., Mao, B.: Borderline-SMOTE: A New Over-sampling Method in Imbalanced Data Sets Learning. Presented at the Proceedings of Advances in Intelligent Computing January 1 (2005).
[6]
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of IJCNN. pp. 1322–1328 (2008).
[7]
Stanfill, C., Waltz, D.: Toward Memory-Based Reasoning. Commun. ACM. 29, 1213–1228 (1986). https://doi.org/10.1145/7902.7906.
[8]
Li, K., Zhang, W., Lu, Q., Fang, X.: An improved SMOTE imbalanced data classification method based on support degree. In: 2014 international conference on identification, information and knowledge in the internet of things. pp. 34–38 (2014). https://doi.org/10.1109/IIKI.2014.14.
[9]
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6, 1–6 (2004). https://doi.org/10.1145/1007730.1007733.
[10]
García, S., Herrera, F.: Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy. Evolutionary Computation. 17, 275–306 (2009). https://doi.org/10.1162/evco.2009.17.3.275.
[11]
Chawla, N.V.: Data Mining for Imbalanced Datasets: An Overview. In: Maimon, O. and Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook. pp. 875–886. Springer US, Boston, MA (2010). https://doi.org/10.1007/978-0-387-09823-4_45.
[12]
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews). 42, 463–484 (2012). https://doi.org/10.1109/TSMCC.2011.2161285.
[13]
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. (JAIR). 16, 321–357 (2002). https://doi.org/10.1613/jair.953.
[14]
Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 14, 106 (2013). https://doi.org/10.1186/1471-2105-14-106.
[15]
Song, K., Yan, F., Ding, T., Gao, L., Lu, S.: A steel property optimization model based on the XGBoost algorithm and improved PSO. Computational Materials Science. 174, 109472 (2020). https://doi.org/10.1016/j.commatsci.2019.109472.
[16]
Lazar, C., Taminau, J., Meganck, S., Steenhoff, D., Coletta, A., Molter, C., de Schaetzen, V., Duque, R., Bersini, H., Nowe, A.: A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis. IEEE/ACM Trans. Comput. Biol. and Bioinf. 9, 1106–1119 (2012). https://doi.org/10.1109/TCBB.2012.33.
[17]
Ben Brahim, A., Limam, M.: Robust ensemble feature selection for high dimensional data sets. In: 2013 International Conference on High Performance Computing & Simulation (HPCS). pp. 151–157. IEEE, Helsinki, Finland (2013). https://doi.org/10.1109/HPCSim.2013.6641406.
[18]
Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., Liu, H.: Feature Selection: A Data Perspective. ACM Comput. Surv. 50, 1–45 (2018). https://doi.org/10.1145/3136625.
[19]
Huang, X., Zhang, L., Wang, B., Li, F., Zhang, Z.: Feature clustering based support vector machine recursive feature elimination for gene selection. Appl Intell. 48, 594–607 (2018). https://doi.org/10.1007/s10489-017-0992-2.
[20]
Kononenko, I.: Estimating attributes: Analysis and extensions of RELIEF. In: Bergadano, F. and Raedt, L. (eds.) Machine Learning: ECML-94. pp. 171–182. Springer Berlin Heidelberg, Berlin, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4_57.
[21]
NIPS 2003 workshop on feature extraction and feature selection challenge, http://clopinet.com/isabelle/Projects/NIPS2003/, last accessed 2021/06/28.
[22]
Zhu, Z., Ong, Y.-S., Dash, M.: Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognition. 40, 3236–3248 (2007). https://doi.org/10.1016/j.patcog.2007.02.007.
[23]
Borovecki, F., Lovrecic, L., Zhou, J., Jeong, H., Then, F., Rosas, H.D., Hersch, S.M., Hogarth, P., Bouzou, B., Jensen, R.V., Krainc, D.: Genome-wide expression profiling of human blood reveals biomarkers for Huntington's disease. Proceedings of the National Academy of Sciences of the United States of America. 102, 11023–11028 (2005).
[24]
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 286, 531–537 (1999). https://doi.org/10.1126/science.286.5439.531.
[25]
Bhattacharjee, A., Richards, W.G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E.J., Lander, E.S., Wong, W., Johnson, B.E., Golub, T.R., Sugarbaker, D.J., Meyerson, M.: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. PNAS. 98, 13790–13795 (2001). https://doi.org/10.1073/pnas.191502998.
[26]
Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., Mesirov, J.P.: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 102, 15545–15550 (2005). https://doi.org/10/d4qbh8.
[27]
Yu, H., Ni, J.: An Improved Ensemble Learning Method for Classifying High-Dimensional and Imbalanced Biomedicine Data. IEEE/ACM Transactions on Computational Biology & Bioinformatics. 11, 657–666 (2014).
[28]
Ri, J., Kim, H.: G-mean based extreme learning machine for imbalance learning. Digital Signal Processing. 98, 102637 (2020). https://doi.org/10.1016/j.dsp.2019.102637.
[29]
Li, M., Xiong, A., Wang, L., Deng, S., Ye, J.: ACO Resampling: Enhancing the performance of oversampling methods for class imbalance classification. Knowledge-Based Systems. 196, 105818 (2020).
[30]
Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning. 20, 273–297 (1995). https://doi.org/10.1023/A:1022627411411.
[31]
Rice, J.C.: Logistic regression: An introduction. Advances in Social Science Methodology. 3, 191–245 (1994).
[32]
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Müller, A., Nothman, J., Louppe, G.: Scikit-learn: Machine Learning in Python. (2012).
[33]
Kovács, G.: An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Applied Soft Computing. 83, 105662 (2019). https://doi.org/10.1016/j.asoc.2019.105662.

Cited By

View all
  • (2024)SDGnE: A Synthetic Data Generation and Evaluation System for Rare Event PredictionDatabase Systems for Advanced Applications10.1007/978-981-97-5575-2_49(508-512)Online publication date: 2-Sep-2024
  • (2024)Incremental SMOTE with Control Coefficient for Classifiers in Data Starved Medical ApplicationsBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_9(112-119)Online publication date: 18-Aug-2024
  • (2023)Improving Classification Performance on Rare Events in Data Starved Medical Applications2023 IEEE International Symposium on Medical Measurements and Applications (MeMeA)10.1109/MeMeA57477.2023.10314855(1-6)Online publication date: 14-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICCDA '22: Proceedings of the 2022 6th International Conference on Compute and Data Analysis
February 2022
131 pages
ISBN:9781450395472
DOI:10.1145/3523089
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 May 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Classification
  2. Imbalanced learning
  3. Microarray gene data
  4. Oversampling
  5. SMOTE

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICCDA 2022

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)4
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SDGnE: A Synthetic Data Generation and Evaluation System for Rare Event PredictionDatabase Systems for Advanced Applications10.1007/978-981-97-5575-2_49(508-512)Online publication date: 2-Sep-2024
  • (2024)Incremental SMOTE with Control Coefficient for Classifiers in Data Starved Medical ApplicationsBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_9(112-119)Online publication date: 18-Aug-2024
  • (2023)Improving Classification Performance on Rare Events in Data Starved Medical Applications2023 IEEE International Symposium on Medical Measurements and Applications (MeMeA)10.1109/MeMeA57477.2023.10314855(1-6)Online publication date: 14-Jun-2023
  • (2023)Classification of tumors based on distinguishing possibilistic biclusters2023 Intelligent Methods, Systems, and Applications (IMSA)10.1109/IMSA58542.2023.10217488(333-338)Online publication date: 15-Jul-2023
  • (2022)Predicting Health Risks of Adult Asthmatics Susceptible to Indoor Air Quality Using Improved Logistic and Quantile Regression ModelsLife10.3390/life1210163112:10(1631)Online publication date: 18-Oct-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media