Abstract
We study the problem of finding supermaximal repeats in large DNA sequences. For this, we propose an algorithm called SMR which uses an auxiliary index structure (POL), which is derived from and replaces the suffix tree index STTD64 [1]. The results of our numerous experiments using the 24 human chromosomes data indicate that SMR outperforms the solution provided as part of the Vmatch [2] software tool. In searching for supermaximal repeats of size at least 10 bases, SMR is twice faster than Vmatch; for a minimum length of 25 bases, SMR is 7 times faster; and for repeats of length at least 200, SMR is about 9 times faster. We also study the cost of POL in terms of time and space requirements.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Halachev, M., Shiri, N., Thamildurai, A.: Efficient and scalable indexing techniques for biological sequence data. In: Hochreiter, S., Wagner, R. (eds.) BIRD 2007. LNCS (LNBI), vol. 4414, pp. 464–479. Springer, Heidelberg (2007)
Vmatch: large scale sequence analysis software, http://www.vmtach.de
Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, New York (1997)
Korf, B.R.: Human Genetics: A Problem-Based Approach. Blackwell, Boston (2000)
Watson, J., Hopkins, N., Roberts, J., Steitz, J., Weiner, A.: Molecular Biology of the Gene, 6th edn. Benjamin-Cummings, Menlo Park (2007)
Kurtz, S.: Reducing the space requirement of suffix trees. Software Practice and Experience 29(13), 1149–1171 (1999)
Grossi, R., Vitter, J.S.: Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing 35(2), 378–407 (2005)
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: 41st IEEE Symposium on Foundations of Computer Science, pp. 390–398 (2000)
Valimaki, N., Gerlach, W., Dixit, K., Makinen, V.: Compressed suffix tree - a basis for genome-scale sequence analysis. Bioinformatics 23(5), 629–630 (2007)
Hon, W.-K., Lam, T.W., Sung, W.-K., Tse, W.-L., Wong, C.-K., Yiu, S.-M.: Practical Aspects of Compressed Suffix Arrays and FM-index in Searching DNA Sequences. In: 6th Workshop on Algorithm Engineering and Experiments, pp. 31–38 (2004)
Kurtz, S., Schleiermacher, C.: REPuter: Fast Computation of Maximal Repeats in Complete Genomes. Bioinformatics 15, 426–427 (1999)
RepeatMatch, http://mummer.sourceforge.net/manual/#repeat
RepeatMasker, http://repeatmasker.org/
Bedell, J.A., Korf, I., Gish, W.: MaskerAid: a Performance Enhancement to RepeatMasker. Bioinformatics 16(11), 1040–1041 (2000)
Gotoh, O.: An Improved Algorithm for Matching Biological Sequences. Journal of Molecular Biology 162(3), 705–708 (1982)
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix tree with enhances suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)
Miki, B.L., Neelin, J.M.: DNA repeat lengths of erythrocyte chromatins differing in content of histones H1 and H5. Nucleic Acids Res. 8(3), 529–542 (1980)
National Center for Biotechnology Information, http://www.ncbi.nim.nih.gov
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lian, C.N., Halachev, M., Shiri, N. (2008). Searching for Supermaximal Repeats in Large DNA Sequences. In: Elloumi, M., Küng, J., Linial, M., Murphy, R.F., Schneider, K., Toma, C. (eds) Bioinformatics Research and Development. BIRD 2008. Communications in Computer and Information Science, vol 13. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70600-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-540-70600-7_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70598-7
Online ISBN: 978-3-540-70600-7
eBook Packages: Computer ScienceComputer Science (R0)