Abstract
In this article, a new method is employed for maximizing the performance of the Least Absolute Shrinkage and Selection Operator (Lasso) feature selection model. In fact, we presented a novel regularization for the Lasso by employing an approach to find the best regularization parameter automatically which guarantees best performance of the Lasso in DNA microarray data classification. In our experiment, four well-known publicly available microarray datasets including breast cancer, Diffuse Large B-cell Lymphoma (DLBCL), leukemia and prostate cancer were utilized for evaluation the proposed methods. Experimental results demonstrated the significant dominance of the proposed Lasso against other widely used feature selection methods in terms of best features that led to best performance, robustness and stability in microarray data classification. Accordingly, the proposed method is a powerful algorithm for selection of most informative features which can be used for cancer diagnosis by gene expression profiles.
Similar content being viewed by others
Data availability
We have used public datasets for our investigation. Hereby, for easy access to the data, they are uploaded to GitHub and can be accessed by following link.
Code availability
Our code is developed in MATLAB platform and can be accessed by following link.
References
Algamal ZY, Lee MH (2015) Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Syst Appl 42:9326–9332
Alshalalfah M, Alhajj R (2009) Cancer class prediction: two stage clusteringapproach to identify informative genes. Intell Data Anal 13:671–686
Anastasis Kratsios CH (2021) A Meta-algorithm for Universal UAP-Invariant feature representation. J Mach Learn Res 22:1–51
Bergadano F, Raedt L (1994) Estimating attributes: analysis and extensions of RELIEF. Springer-Verlag, Berlin
Bhatti UA, Huang M, Wu D, Zhang Y, Mehmood A, Han H. Recommendation system using feature extraction and pattern recognition in clinical care systems. Enterp Inf Syst 13(3):329–351. https://doi.org/10.1080/17517575.2018.1557256
Bolón-Canedo V, Alonso-Betanzos A (2019) Ensembles for feature selection: a review and future trends. Inform Fusion 52:1–12
Bolstad BM, Irizarry RA, Astrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2):185–193
Chen X-w, Wasikowski M (2008) A roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 124–132
Ding C, Peng H (2005). Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(2):185–205
Drummond C, Holte RC (11 2006) An improved method for visualizing classifier performance. Mach Learn 65(1):95–130
Fu G, Wang P (2014) LASSO-type variable selection methods for high-dimensional data. Appl Mech Mater 444–445:604–609
Golub T et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Hira ZM, Gillies DF (2015) A review of feature selection and feature extraction methods Applied on microarray data. Adv Bioinform. https://doi.org/10.1155/2015/198363
Hsu N-J, Hung H-L, Chang Y-M (2008) Subset selection for vector autoregressive processes using Lasso. ScienceDirect 52(7):3645–3657
Huang H-H, Liu X-Y, Liang Y (2016) Feature selection and cancer classification via sparse logistic regression with the hybrid L1/2 + 2 regularization. PLoS ONE 11(5):e0149675
Liu H (2010) Feature Selection. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning. Springer US, Boston, pp 402–406
Huang S, Huang M, Zhang Y, Chen J, Bhatti U (2020) Medical image segmentation using deep learning with feature enhancement. IET Image Proc 14:3324–3332. https://doi.org/10.1049/iet-ipr.2019.0772
Hussain Shah S, Iqbal MJ, Ahmad I, Khan S, Rodrigues JJPC (2020) Optimized gene selection and classification of cancer from microarray gene expression data using deep learning. Neural Comput Appl:1433–3058
Ijaz MF, Attique M, Son Y (2020) Data-driven cervical cancer prediction model with outlier detection and over-sampling methods. Sensors: 20(10):2809, [Online]. Available: https://www.mdpi.com/1424-8220/20/10/2809
Jiang L, Greenwood CMT, Yao W, Li L (2020) Bayesian Hyper-LASSO classification for feature selection with application to Endometrial Cancer RNA-seq data. Sci Rep 10(1):9747. https://doi.org/10.1038/s41598-020-66466-z
Jolliffe I (2005) Principal component analysis. Wiley Online Library
Kang C, Huo Y, Xin L, Tian B, Yu B (2018) Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine. J Theor Biol 463:77–91
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1):273–324
Mandal M, Singh PK, Ijaz MF, Shafi J, Sarkar R (2021) A tri-stage wrapper-filter feature selection framework for disease classification. Sensors 21(16):5571, [Online]. Available: https://www.mdpi.com/1424-8220/21/16/5571
Momenzadeh M, Sehhati M, Rabbani H (2019) A novel feature selection method for microarray data classification based on hidden Markov model. J Biomed Inform. https://doi.org/10.1016/j.jbi.2019.103213
Momenzadeh M, Sehhati M, Rabbani H (2020) Using hidden Markov model to predict recurrence of breast cancer based on sequential patterns in gene expression profiles. J Biomed Inform 111:1–9
Mundra PA, Rajapakse JC (2010) SVM-RFE with MRMR filter for gene selection. IEEE Trans Nanobiosci 9(1):1–37
Navin Lal T, Chapelle O, Weston J, Elisseeff A (2006) Embedded methods. Springer-Verlag, Berlin
Nguyen T, Khosravi A, Creighton D, Nahavandi S (2015) A novel aggregate gene selection method for microarray data classification. Pattern Recognit Lett:16–23. https://doi.org/10.1016/j.patrec.2015.03.018
Rohini R, Muthukrishnan R (2016) LASSO: a feature selection technique in predictive modeling for machine learning. In: IEEE International Conference on Advances in Computer Applications
Roweis ST, Saul LK (12 2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326
Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
Sanchez-Marono N, Alonso-Betanzos A, Tombilla-Sanroman M (2007) Filter methods for feature selection–a comparative study. Intelligent Data Engineering and Automated Learning, pp 178–187
Singh D et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):0–209
Shipp MA et al (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8(1):68–74
Srinivasu PN, SivaSai JG, Ijaz MF, Bhoi AK, Kim W, Kang JJ (2021) Classification of skin disease using deep learning neural networks with MobileNet V2 and LSTM. Sensors 21(8):2852, [Online]. Available: https://www.mdpi.com/1424-8220/21/8/2852
Tibshirani GJDWTHR (2013) An introduction to statistical learning. Springer, Berlin
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B (Methodological) 58(1):267–288
Tibshirani R (1997) The Lasso method for variable selection in the cox model. Stat Med 16(4):385–395
Ulisses ERD, Braga-Neto M (2004) Is cross-validation valid for small-sample microarray classification? Bioinformatics 20(3):374–380. https://doi.org/10.1093/bioinformatics/btg419
van ’t Veer LJ et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871):530–536
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83
Wong T-T (2015) Performance evaluation of classification algorithms by k-fold and leave-one-outcross validation. Pattern Recognit:2839–2846. https://doi.org/10.1016/j.patcog.2015.03.009
Zeeshan Z et al (2021) Feature-based multi-criteria recommendation system using a weighted approach with ranking correlation. Intell Data Anal 25:1013–1029. https://doi.org/10.3233/IDA-205388
Zeebaree DQ, Haron H, Abdulazeez AM (2018) Gene selection and classification of microarray data using convolutional neural network. In: International Conference on Advanced Science and Engineering, Kurdistan Region
Author information
Authors and Affiliations
Contributions
Mehrdad Vatankhah, as the first author, has done Implementation of the computer code, and supporting algorithms, Writing, Initial draft preparation.
Mohammadreza Momenzadeh, as the corresponding author, has done the Project administration, Writing, Reviewing and Editing, Data curation, and Conceptualization.
Corresponding author
Ethics declarations
Ethics approval
Not applicable.
Consent to participate
We confirm that this paper is the results of the original work done by Mehrdad Vatankhah and MohamadReza Momenzadeh, and there is no other authors or co-workers.
Consent for publication
We confirm that this paper contains the results of the original work done by us and has never been submitted to other journals or conferences.
Conflicts of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vatankhah, M., Momenzadeh, M. Self-regularized Lasso for selection of most informative features in microarray cancer classification. Multimed Tools Appl 83, 5955–5970 (2024). https://doi.org/10.1007/s11042-023-15207-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15207-1