Abstract
Examining the genome sequences of the SARS-CoV-2 virus, that causes the respiratory disease known as coronavirus disease 2019 (COVID-19), play important role in the proper understanding of this virus, its main characteristics and functionalities. This paper investigates the use of alignment-free (AF) sequence analysis and sequential pattern mining (SPM) to analyze SARS-CoV-2 genome sequences and learn interesting information about them respectively. AF methods are used to find (dis)similarity in the genome sequences of SARS-CoV-2 by using various distance measures, to compare the performance of these measures and to construct the phylogenetic trees. SPM algorithms are used to discover frequent amino acid patterns and their relationship with each other and to predict the amino acid(s) by using various sequence-based prediction models. In last, an algorithm is proposed to analyze mutation in genome sequences. The algorithm finds the locations for changed amino acid(s) in the genome sequences and computes the mutation rate. From obtained results, it is found that that both AF and SPM methods can be used to discover interesting information/patterns in SARS-CoV-2 genome sequences for examining the variations and evolution among strains.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The code for Algorithm 1 in Python and the genome sequences used in the experiments are available at: https://github.com/saqibdola/SPM-MA4GSA/tree/master/MAP.
References
Wu F et al (2020) A new coronavirus associated with human respiratory disease in China. Nature 579:265–269
Coronaviridae Study Group of the International Committee on Taxonomy of Viruses (2020) The species Severe acute respiratory syndrome-related coronavirus: Classifying 2019-nCoV and naming it SARS-CoV-2. Nat Microbiol 5:536–544
Mount DM (2004) Bioinformatics: Sequence and Genome Analysis, 2nd edn. Cold Spring Harbor Laboratory Press
Aggarwal C, Bhuiyan M, Hasan M (2014) Frequent pattern mining algorithms: A survey. In: Frequent Pattern Mining, Springer
Zielezinski A et al (2017) Alignment-free sequence comparison: Benefits, applications, and tools. Genome Biol 18:186
Vinga S (2014) Information theory applications for biological sequence analysis. Brief Bioninf 15(3):376–389
Vinga S, Almeida J (2003) Alignment-free sequence comparison- A review. Bioinformatics 19:513–523
Zielezinski A et al (2019) Benchmarking of alignment-free sequence comparison methods. Genome Biol 20:144
Fournier-Viger P et al (2017) A survey of sequential pattern mining. Data Sci Patt Recog 1:54–77
Karim MR et al (2013) An efficient approach to mining maximal contiguous frequent patterns from large DNA sequence databases. Genomics Informat 10(1):51–57
Kawade DR, Oza KS (2013) Exploration of DNA sequences using pattern mining. J Biomed Informa 2:144–148
Nawaz MS, Fournier-Viger P, Shojaee A, Fujita H (2021) Using artificial intelligence techniques for COVID-19 genome analysis. Appl Intell 51(5):3086–3103
Ni L et al (2020) Mining the local dependency itemset in a products network. ACM Trans Manage Infor Syst 11 (1): 3:1-3:31
Mustafa RU et al (2017) Early detection of controversial urdu speeches from social media. Data Scie Patt Recogn 1(2):26–42
Pokou YJM, Fournier-Viger P, Moghrabi C (2016) Authorship attribution using small sets of frequent part-of-speech skip-grams. In: Proceedings of FLAIRS, pp. 86-91
Nawaz MS, Fournier-Viger P, Zhang J (2020) Proof learning in PVS with utility pattern mining. IEEE Access 8:119806–119818
Nawaz MS, Sun M, Fournier-Viger P (2019). Proof guidance in PVS with sequential pattern mining. In: Proceedings of FSEN, pp. 45-60
Schweizer D et al (2015) Using consumer behavior data to reduce energy consumption in smarthomes: Applying machine learning to save energy without lowering comfort of inhabitants. In: Proceedings of ICMLA, pp. 1123-1129
Nawaz MS et al (2022) MalSPM: Metamorphic malware behavior analysis and classification using sequential pattern mining. Computers & Security 118:102741
Fournier-Viger P, Gueniche T, Tseng VS (2012). Using partially-ordered sequential rules to generate more accurate sequence prediction. In: Proceedings of ADMA, pp. 431-442
Nawaz MS et al (2021) COVID-19 genome analysis using alignment-free methods. In: Proceedings of IEA AIE, pp. 316-328
Rondo HM et al (2021) Pathogenesis, symptomatology, and transmission of SARS-CoV-2 through analysis of viral Genomics and structure. mSystems 6(5): e00095-21
Nawaz MS, Fournier-Viger, P, He Y (2022) S-PDB: Analysis and classification of SARS-CoV-2 Spike protein structures. In: Proceedings of BIBM, pp. 2259-2265
Khailany RA, Safdar M, Ozaslanc M (2020) Genomic characterization of a novel SARS-CoV-2. Gene Reports 19:100682
Shu J-J (2017) A new integrated symmetrical table for genetic codes. Biosystems 151:21–26
Mohamadou Y, Halidou A, Kapen PT (2020) A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of COVID-19. Appl Intell 50:3913–3925
Nayak J et al (2021) Intelligent system for COVID-19 prognosis: A state-of-the-art survey. Appl Intell 51:2908–2938
Alyasseri Z et al (2021) Review on COVID-19 diagnosis models based on machine learning and deep learning approaches. Expert Systems e12759
Lalmuanawma S, Hussain J, Chhakchhuak L (2020) Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: A review. Chaos Solito 139:110059
Chen J, See JC (2020) Artificial intelligence for COVID-19: Rapid review. J Med Internet Res 22:e21476
Rasheed J et al (2021) COVID-19 in the age of artificial intelligence: A comprehensive review. Interdiscip Sci Comput Life Sci 13:153–175
Shi F et al (2021) Review of artificial intelligence techniques in imaging data acquisition, segmenta-tion and diagnosis for COVID-19. IEEE Rev Biomed Engg 21:4–15
Driggs D et al (2021) Machine Learning for COVID-19 diagnosis and prognostication: Lessons for amplifying the signal while reducing the noise. Radiology: Artificial Intelligence 3(4): e210011
Roberts M et al (2021) Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat Mach Intell 3:199–217
Wynants L et al (2020) Prediction models for diagnosis and prognosis of COVID-19: Systematic review and critical appraisal. BMJ 369:m1328
Noor S et al (2020) Analysis of public reactions to the novel coronavirus (COVID-19) outbreak on Twitter. Kybernetes 50(5):1633–1653
Heng JW, Juwono FH, Reine R (2021) Using optimal sequencing algorithms for COVID-19 case study. In: Proceedings GECOST, pp. 1-4
Pathan RK, Biswas M, Khandaker MU (2020) Time series prediction of COVID19 by mutation rate analysis using recurrent neural network-based LSTM model. Chaos Solit 138:110018
Zelenova M (2021) Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database. Comput Biol Med 139:104981
Kali K (2021) The lag in SARS-CoV-2 genome submissions to GISAID. Nat Biotechnol 39:1058–1060
Arslan H (2021) Machine learning methods for COVID-19 prediction using human genomic data. Proceedings 74(1), 20
Arslan H, Arslan H (2021) A new COVID-19 detection method from human genome sequences using CpG island features and KNN classifier. Int J Eng Sci Technol 24(4):839–847
Arslan H (2021) COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus. Comput Ind Eng 161:107666
Lopez-Rincon et al (2021) Classification and specific primer design for accurate detection of SARS-CoV-2 using deep learning. Scient Rep 11:947
Naeem SM (2021) A diagnostic genomic signal processing (GSP)-based system for automatic feature analysis and detection of COVID-19. Brief Bioinf 22(2):1197–1205
Randhawa GS et al (2020) Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. PLoS One 15(4):e0232391
Ahmed I, Jeon G (2021) Enabling artificial intelligence for genome sequence analysis of COVID-19 and alike viruses. Interdiscip Sci 6:1–16
Ren J et al (2018) Alignment free sequence analysis and applications. Annu Rev Biomed Sci 1:93–114
Bonham-Carter O et al (2014) Alignment-free genetic sequence comparisons: A review of recent approaches by word analysis. Brief Bioinf 15(6):890–905
Song J et al (2014) New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinf 15(3):343–353
Lu YY et al (2017) CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic Acids Res 45(Web Server issue): W554-W559
Frigessi A, Heidergott B (2011) Markov Chains. In: Lovric M (ed) International Encyclopedia of Statistical Science. Springer
Otu HH, Sayood KA (2003) A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19(1):2122–2130
Li M et al (2004) The similarity metric. IEEE Trans Infor Theory 50(12):3250–64
Giancarlo R, Rombo SE, Utro F (2014) Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief Bioinf 15(3):390–406
Sayers EW et al (2019) Genbank. Nucleic Acids Res 48(D1):D84–D86
Fournier-Viger P et al (2016). The SPMF open-source data mining library version 2. In: Proceedings ECML PKDD, pp. 36-40
Ayres J (2002). Sequential pattern mining using a bitmap representation. In: Proceedings KDD, pp. 429-435
Fournier-Viger P et al (2013) TKS: Efficient mining of top-k sequential patterns. In: Proceedings of Advanced Data Mining and Applications (ADMA), pp. 109-120
Fournier-Viger P (2014). Fast vertical mining of sequential patterns using co-occurrence information. In: Proceedings of PAKDD, pp. 40-52
Aggarwal CC, Han J (2014) Frequent Pattern Mining. Springer
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings VLDB, pp. 487-499
Fournier-Viger P (2014). ERMiner: Sequential rule mining using equivalence classes. In: Proceedings of IDA, pp. 108-119
Gueniche T et al (2015) CPT+: Decreasing the time/space complexity of the compact prediction tree. In: Proceedings of PAKDD, pp. 625-636
Gueniche T, Fournier-Viger P, Tseng VS (2013). Compact prediction tree: A lossless model for accurate sequence prediction. In: Proceedings of AADMA, pp. 177-188
Padmanabhan VN, Mogul JC (1996) Using predictive prefetching to improve world wide web latency. Comp Comm Rev 26:22–36
Pitkow J, Pirolli P (1999) Mining longest repeating subsequence to predict world wide web surfing. In: Proceedings of USENIX Symposium on Internet Technologies and Systems, pp. 13-25
Deshpande M, Karypis G (2004) Selective markov models for predicting web page accesses. ACM Trans. Inter. Techn. 4:163–184
Laird P, Saul R (1994) Discrete sequence prediction and its applications. Machine Learning 15:43–68
Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans. Infor. Theory. 24:530–536
Altschul SF et al (1990) Basic local alignment search tool. J. Molec. Biolo. 215(3):403–410
Dong et al (2020) Analysis of the hosts and transmission paths of SARS-CoV-2 in the COVID-19 outbreak. Genes 11(6):637
Pachetti M et al (2020) Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant. J. Transl. Medi. 18:179
Ventura S, Luna JM (2018) Supervised Descriptive Pattern Mining. Springer
Acknowledgements
This work was supported by Natural Science Foundation of Guangdong Province (2023A1515011667) and Basic Research Foundations of Shenzhen (JCYJ20210324093609026)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Authors declare no conflict on interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nawaz, M.S., Fournier-Viger, P., Aslam, M. et al. Using alignment-free and pattern mining methods for SARS-CoV-2 genome analysis. Appl Intell 53, 21920–21943 (2023). https://doi.org/10.1007/s10489-023-04618-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04618-0