. 1996 Aug 20;93(17):9061–9066. doi: 10.1073/pnas.93.17.9061

Gene recognition via spliced sequence alignment.

M S Gelfand ¹, A A Mironov ¹, P A Pevzner ¹

PMCID: PMC38595 PMID: 8799154

Abstract

Gene recognition is one of the most important problems in computational molecular biology. Previous attempts to solve this problem were based on statistics, and applications of combinatorial methods for gene recognition were almost unexplored. Recent advances in large-scale cDNA sequencing open a way toward a new approach to gene recognition that uses previously sequenced genes as a clue for recognition of newly sequenced genes. This paper describes a spliced alignment algorithm and software tool that explores all possible exon assemblies in polynomial time and finds the multiexon structure with the best fit to a related protein. Unlike other existing methods, the algorithm successfully recognizes genes even in the case of short exons or exons with unusual codon usage; we also report correct assemblies for genes with more than 10 exons. On a test sample of human genes with known mammalian relatives, the average correlation between the predicted and actual proteins was 99%. The algorithm correctly reconstructed 87% of genes and the rare discrepancies between the predicted and real exon-intron structures were caused either by short (less than 5 amino acids) initial/terminal exons or by alternative splicing. Moreover, the algorithm predicts human genes reasonably well when the homologous protein is nonvertebrate or even prokaryotic. The surprisingly good performance of the method was confirmed by extensive simulations: in particular, with target proteins at 160 accepted point mutations (PAM) (25% similarity), the correlation between the predicted and actual genes was still as high as 95%.

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

Adams M. D., Kerlavage A. R., Fields C., Venter J. C. 3,400 new expressed sequence tags identify diversity of transcripts in human brain. Nat Genet. 1993 Jul;4(3):256–267. doi: 10.1038/ng0793-256. [DOI] [PubMed] [Google Scholar]
Altschul S. F. Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 1991 Jun 5;219(3):555–565. doi: 10.1016/0022-2836(91)90193-A. [DOI] [PMC free article] [PubMed] [Google Scholar]
Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
Dong S., Searls D. B. Gene structure prediction by linguistic methods. Genomics. 1994 Oct;23(3):540–551. doi: 10.1006/geno.1994.1541. [DOI] [PubMed] [Google Scholar]
Fickett J. W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982 Sep 11;10(17):5303–5318. doi: 10.1093/nar/10.17.5303. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gelfand M. S. Computer prediction of the exon-intron structure of mammalian pre-mRNAs. Nucleic Acids Res. 1990 Oct 11;18(19):5865–5869. doi: 10.1093/nar/18.19.5865. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gelfand M. S., Podolsky L. I., Astakhova T. V., Roytberg M. A. Recognition of genes in human DNA sequences. J Comput Biol. 1996 Summer;3(2):223–234. doi: 10.1089/cmb.1996.3.223. [DOI] [PubMed] [Google Scholar]
Gelfand M. S. Prediction of function in DNA sequence analysis. J Comput Biol. 1995 Spring;2(1):87–115. doi: 10.1089/cmb.1995.2.87. [DOI] [PubMed] [Google Scholar]
Gelfand M. S., Roytberg M. A. Prediction of the exon-intron structure by a dynamic programming approach. Biosystems. 1993;30(1-3):173–182. doi: 10.1016/0303-2647(93)90069-o. [DOI] [PubMed] [Google Scholar]
Gish W., States D. J. Identification of protein coding regions by database similarity search. Nat Genet. 1993 Mar;3(3):266–272. doi: 10.1038/ng0393-266. [DOI] [PubMed] [Google Scholar]
Glasser S. W., Korfhagen T. R., Perme C. M., Pilot-Matias T. J., Kister S. E., Whitsett J. A. Two SP-C genes encoding human pulmonary surfactant proteolipid. J Biol Chem. 1988 Jul 25;263(21):10326–10331. [PubMed] [Google Scholar]
Guigó R., Knudsen S., Drake N., Smith T. Prediction of gene structure. J Mol Biol. 1992 Jul 5;226(1):141–157. doi: 10.1016/0022-2836(92)90130-c. [DOI] [PubMed] [Google Scholar]
Harr R., Häggström M., Gustafsson P. Search algorithm for pattern match analysis of nucleic acid sequences. Nucleic Acids Res. 1983 May 11;11(9):2943–2957. doi: 10.1093/nar/11.9.2943. [DOI] [PMC free article] [PubMed] [Google Scholar]
Legouis R., Hardelin J. P., Levilliers J., Claverie J. M., Compain S., Wunderle V., Millasseau P., Le Paslier D., Cohen D., Caterina D. The candidate gene for the X-linked Kallmann syndrome encodes a protein related to adhesion molecules. Cell. 1991 Oct 18;67(2):423–435. doi: 10.1016/0092-8674(91)90193-3. [DOI] [PubMed] [Google Scholar]
Myers E. W., Miller W. Approximate matching of regular expressions. Bull Math Biol. 1989;51(1):5–37. doi: 10.1007/BF02458834. [DOI] [PubMed] [Google Scholar]
Pascarella S., Argos P. Analysis of insertions/deletions in protein structures. J Mol Biol. 1992 Mar 20;224(2):461–471. doi: 10.1016/0022-2836(92)91008-d. [DOI] [PubMed] [Google Scholar]
Sankoff D. Efficient optimal decomposition of a sequence into disjoint regions, each matched to some template in an inventory. Math Biosci. 1992 Oct;111(2):279–293. doi: 10.1016/0025-5564(92)90075-8. [DOI] [PubMed] [Google Scholar]
Snyder E. E., Stormo G. D. Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Res. 1993 Feb 11;21(3):607–613. doi: 10.1093/nar/21.3.607. [DOI] [PMC free article] [PubMed] [Google Scholar]
Snyder E. E., Stormo G. D. Identification of protein coding regions in genomic DNA. J Mol Biol. 1995 Apr 21;248(1):1–18. doi: 10.1006/jmbi.1995.0198. [DOI] [PubMed] [Google Scholar]
Solovyev V. V., Salamov A. A., Lawrence C. B. Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 1994 Dec 11;22(24):5156–5163. doi: 10.1093/nar/22.24.5156. [DOI] [PMC free article] [PubMed] [Google Scholar]
Song I., Brown D. R., Wiltshire R. N., Gantz I., Trent J. M., Yamada T. The human gastrin/cholecystokinin type B receptor gene: alternative splice donor site in exon 4 generates two variant mRNAs. Proc Natl Acad Sci U S A. 1993 Oct 1;90(19):9085–9089. doi: 10.1073/pnas.90.19.9085. [DOI] [PMC free article] [PubMed] [Google Scholar]
Uberbacher E. C., Mural R. J. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci U S A. 1991 Dec 15;88(24):11261–11265. doi: 10.1073/pnas.88.24.11261. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wilbur W. J., Lipman D. J. Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci U S A. 1983 Feb;80(3):726–730. doi: 10.1073/pnas.80.3.726. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Gene recognition via spliced sequence alignment.

M S Gelfand

A A Mironov

P A Pevzner

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Gene recognition via spliced sequence alignment.

M S Gelfand

A A Mironov

P A Pevzner

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases