Abstract
We describe a novel method for efficient reconstruction of phylogenetic trees, based on sequences of whole genomes or proteomes. The core of our method is a new measure of pairwise distances between sequences, whose lengths may greatly vary. This measure is based on information theoretic tools (Kullback-Leibler relative entropy). We present an algorithm for efficiently computing these distances. The algorithm uses suffix arrays to compute the distance of two ℓ long sequences in O(ℓ) time. It is fast enough to enable the construction of the phylogenomic tree for hundreds of species, and the phylogenomic forest for almost two thousand viruses. An initial analysis of the results exhibits a remarkable agreement with “acceptable phylogenetic truth”. To assess our approach, it was implemented together with a number of alternative approaches, including two that were previously published in the literature. Comparing their outcome to ours, using a “traditional” tree and a standard tree comparison method, our algorithm improved upon the “competition” by a substantial margin.
Research supported by ISF grant 418/00.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ben-Dor, A., Chor, B., Graur, D., Ophir, R., Pelleg, D.: Constructing phylogenies from quartets: elucidation of eutherian superordinal relationships. Journal of computational Biology 5, 377–390 (1998)
Bininda-Emonds, O.: Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life. Kluwer series in Computational Biology (2004)
Blusch, J.H., Seelmeir, S., Helm, K.V.: Molecular and enzymatic characterization of the porcine endogenous retrovirus protease. Virol 76(15), 7913–7917 (2002)
Bourque, G., Pevzner, P.A.: Genome-scale evolution: Reconstructing gene order in ancestral species. Genome Research 12, 26–36 (2002)
Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences and its applications in genome comparison. In: RECOMB, pp. 107–117 (2000)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. J. Wiley and Sons, New York (1991)
Darwin, C.: On the origin of species, 1st edn., November 24 (1859)
NCBI Taxonomy Database, http://www.ncbi.nlm.nih.gov/entrez/linkout/tutorial/taxtour.html
The Universal Virus Database, http://www.ncbi.nlm.nih.gov/ictvdb/ictvdb/
Downie, S., Palmer, J.: Use of chloroplast dna rearangements in reconstructing plant phylogeny. In: Soltis, P., Soltis, D., Doyle, J. (eds.) Plant Molecular Systematics, pp. 14–35. Chapman and Hall, Boca Raton (1992)
Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)
Endres, D.M., Schindelin, J.E.: A new metric for probability distribution. IEEE Tran. Inf. Theory. 49(7) (2003)
NCBI Genome Entrez, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genome
Farach, M., Noordewier, M., Savari, S., Shepp, L., Wyner, A., Ziv, J.: On the entropy of dna: Algorithms and measurements based on memory and rapid. In: Symposium on Discrete Algorithms (1994)
Felsenstein, J.: Phylip (phylogeny inference package) version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle (1993)
Hannenhalli, S., Pevzner, P.: Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals). In: Proc. 27th Annual ACM Symposium on the Theory of Computing, pp. 178–189 (1995)
Ribosomal Database Project II, http://rdp.cme.msu.edu/html/
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22, 75–88 (1976)
Lempel, A., Ziv, J.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory (1977)
Li, M., Badger, J., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)
Ma, B., Li, M., Zhang, L.: From gene trees to species trees. SIAM, Philadelphia (1998)
Moret, B.M.E., Wang, L.S., Warnow, T., Wyman, S.K.: New approaches for reconstructing phylogenies from gene order data. bioinformatics 17, 165–173 (2001)
Nelson, M.: LZW Data Compression (1989)
Origins of viruses, http://www.mcb.uct.ac.za/tutorial/virorig.html
Otu, H.H., Sayood, K.: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19(16) (2003)
Qi, J., Wang, B., Hao, B.: Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach. J. Mol. Evol. 58(1), 1–11 (2004)
Raul, P.T., Gordon, B., Oliver, E.: Quartet Supertrees. In: Bininda-Emonds, Olaf, R.P. (eds.) Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life, pp. 173–191. Kluwer Academic, Dordrecht (2004) (in press)
Sacco, M.A., Flannery, D.M.J., Howes, K., Venugopal, K.: Avian endogenous retrovirus eav-hp shares regions of identity with avian leukosis virus subgroup j and the avian retrotransposon art-ch. J. Virol 74(3), 1296–1306 (2000)
Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987)
Sayood, K.: Introduction to data compression, 2nd edn.
Stefan, B., Kärkkäinen, J.: Fast lightweight suffix array construction and checking, pp. 55–69 (2003)
Stuart, G.W., Berry, M.W.: A comprehensive whole genome bacterial phylogeny using correlated peptide motive defined in a high dimensional vector space. Journal of Bioinformatics and Computational Biology 1(3), 475–493 (2003)
Triyatnib, M., Ey, P.L., Tran, T., Mire, M.L., Qiao, M., Burrell, C.J., Jilbert, A.R.: Sequence comparison of an australian duck hepatitis b virus strain with other avian hepadnaviruses. Journal of General Virology 82, 373–378 (2001)
Waters, E., Hohn, M.J., Ahel, I., Graham, D.E., Adams, M.D., Barnstead, M., Beeson, K.Y., Bibbs, L., Bolanos, R., Keller, M., Kretz, K., Lin, X., Mathur, E., Ni, J., Podar, M., Richardson, T., Sutton, G.G., Simon, M., Söll, D., Stetter, K.O., Short, J.M., Noordewier, M.: The genome of nanoarchaeum equitans: Insights into early archaeal evolution and derived parasitism. Proc. Natl. Acad. Sci. USA 100(22), 12984–12988 (2003)
Wyner, A.D., Wyner, A.J.: An improved version of lempel-ziv algorithm. IEEE Tran. Inf. Theory (1995)
Wyner, A.J.: String matching theorems and applications to data compression and statistics. Ph.d., Stanford (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Burstein, D., Ulitsky, I., Tuller, T., Chor, B. (2005). Information Theoretic Approaches to Whole Genome Phylogenies. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds) Research in Computational Molecular Biology. RECOMB 2005. Lecture Notes in Computer Science(), vol 3500. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11415770_22
Download citation
DOI: https://doi.org/10.1007/11415770_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25866-7
Online ISBN: 978-3-540-31950-4
eBook Packages: Computer ScienceComputer Science (R0)