Information Theoretic Approaches to Whole Genome Phylogenies

David Burstein²⁵,
Igor Ulitsky²⁵,
Tamir Tuller²⁵ &
…
Benny Chor²⁵

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3500))

Included in the following conference series:

Annual International Conference on Research in Computational Molecular Biology

1196 Accesses
4 Citations

Abstract

We describe a novel method for efficient reconstruction of phylogenetic trees, based on sequences of whole genomes or proteomes. The core of our method is a new measure of pairwise distances between sequences, whose lengths may greatly vary. This measure is based on information theoretic tools (Kullback-Leibler relative entropy). We present an algorithm for efficiently computing these distances. The algorithm uses suffix arrays to compute the distance of two ℓ long sequences in O(ℓ) time. It is fast enough to enable the construction of the phylogenomic tree for hundreds of species, and the phylogenomic forest for almost two thousand viruses. An initial analysis of the results exhibits a remarkable agreement with “acceptable phylogenetic truth”. To assess our approach, it was implemented together with a number of alternative approaches, including two that were previously published in the literature. Comparing their outcome to ours, using a “traditional” tree and a standard tree comparison method, our algorithm improved upon the “competition” by a substantial margin.

Research supported by ISF grant 418/00.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Computational Approaches to Studying Molecular Phylogenetics

A sequence-based evolutionary distance method for Phylogenetic analysis of highly divergent proteins

Article Open access 20 November 2023

PQ, a new program for phylogeny reconstruction

Article Open access 12 October 2018

References

Ben-Dor, A., Chor, B., Graur, D., Ophir, R., Pelleg, D.: Constructing phylogenies from quartets: elucidation of eutherian superordinal relationships. Journal of computational Biology 5, 377–390 (1998)
Article Google Scholar
Bininda-Emonds, O.: Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life. Kluwer series in Computational Biology (2004)
Google Scholar
Blusch, J.H., Seelmeir, S., Helm, K.V.: Molecular and enzymatic characterization of the porcine endogenous retrovirus protease. Virol 76(15), 7913–7917 (2002)
Article Google Scholar
Bourque, G., Pevzner, P.A.: Genome-scale evolution: Reconstructing gene order in ancestral species. Genome Research 12, 26–36 (2002)
Google Scholar
Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences and its applications in genome comparison. In: RECOMB, pp. 107–117 (2000)
Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. J. Wiley and Sons, New York (1991)
Book MATH Google Scholar
Darwin, C.: On the origin of species, 1st edn., November 24 (1859)
Google Scholar
NCBI Taxonomy Database, http://www.ncbi.nlm.nih.gov/entrez/linkout/tutorial/taxtour.html
The Universal Virus Database, http://www.ncbi.nlm.nih.gov/ictvdb/ictvdb/
Downie, S., Palmer, J.: Use of chloroplast dna rearangements in reconstructing plant phylogeny. In: Soltis, P., Soltis, D., Doyle, J. (eds.) Plant Molecular Systematics, pp. 14–35. Chapman and Hall, Boca Raton (1992)
Google Scholar
Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)
Book MATH Google Scholar
Endres, D.M., Schindelin, J.E.: A new metric for probability distribution. IEEE Tran. Inf. Theory. 49(7) (2003)
Google Scholar
NCBI Genome Entrez, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genome
Farach, M., Noordewier, M., Savari, S., Shepp, L., Wyner, A., Ziv, J.: On the entropy of dna: Algorithms and measurements based on memory and rapid. In: Symposium on Discrete Algorithms (1994)
Google Scholar
Felsenstein, J.: Phylip (phylogeny inference package) version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle (1993)
Google Scholar
Hannenhalli, S., Pevzner, P.: Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals). In: Proc. 27th Annual ACM Symposium on the Theory of Computing, pp. 178–189 (1995)
Google Scholar
Ribosomal Database Project II, http://rdp.cme.msu.edu/html/
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22, 75–88 (1976)
Article MATH MathSciNet Google Scholar
Lempel, A., Ziv, J.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory (1977)
Google Scholar
Li, M., Badger, J., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)
Article Google Scholar
Ma, B., Li, M., Zhang, L.: From gene trees to species trees. SIAM, Philadelphia (1998)
Google Scholar
Moret, B.M.E., Wang, L.S., Warnow, T., Wyman, S.K.: New approaches for reconstructing phylogenies from gene order data. bioinformatics 17, 165–173 (2001)
Google Scholar
Nelson, M.: LZW Data Compression (1989)
Google Scholar
Origins of viruses, http://www.mcb.uct.ac.za/tutorial/virorig.html
Otu, H.H., Sayood, K.: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19(16) (2003)
Google Scholar
Qi, J., Wang, B., Hao, B.: Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach. J. Mol. Evol. 58(1), 1–11 (2004)
Article Google Scholar
Raul, P.T., Gordon, B., Oliver, E.: Quartet Supertrees. In: Bininda-Emonds, Olaf, R.P. (eds.) Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life, pp. 173–191. Kluwer Academic, Dordrecht (2004) (in press)
Google Scholar
Sacco, M.A., Flannery, D.M.J., Howes, K., Venugopal, K.: Avian endogenous retrovirus eav-hp shares regions of identity with avian leukosis virus subgroup j and the avian retrotransposon art-ch. J. Virol 74(3), 1296–1306 (2000)
Article Google Scholar
Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987)
Google Scholar
Sayood, K.: Introduction to data compression, 2nd edn.
Google Scholar
Stefan, B., Kärkkäinen, J.: Fast lightweight suffix array construction and checking, pp. 55–69 (2003)
Google Scholar
Stuart, G.W., Berry, M.W.: A comprehensive whole genome bacterial phylogeny using correlated peptide motive defined in a high dimensional vector space. Journal of Bioinformatics and Computational Biology 1(3), 475–493 (2003)
Article Google Scholar
Triyatnib, M., Ey, P.L., Tran, T., Mire, M.L., Qiao, M., Burrell, C.J., Jilbert, A.R.: Sequence comparison of an australian duck hepatitis b virus strain with other avian hepadnaviruses. Journal of General Virology 82, 373–378 (2001)
Google Scholar
Waters, E., Hohn, M.J., Ahel, I., Graham, D.E., Adams, M.D., Barnstead, M., Beeson, K.Y., Bibbs, L., Bolanos, R., Keller, M., Kretz, K., Lin, X., Mathur, E., Ni, J., Podar, M., Richardson, T., Sutton, G.G., Simon, M., Söll, D., Stetter, K.O., Short, J.M., Noordewier, M.: The genome of nanoarchaeum equitans: Insights into early archaeal evolution and derived parasitism. Proc. Natl. Acad. Sci. USA 100(22), 12984–12988 (2003)
Article Google Scholar
Wyner, A.D., Wyner, A.J.: An improved version of lempel-ziv algorithm. IEEE Tran. Inf. Theory (1995)
Google Scholar
Wyner, A.J.: String matching theorems and applications to data compression and statistics. Ph.d., Stanford (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Tel Aviv University,
David Burstein, Igor Ulitsky, Tamir Tuller & Benny Chor

Authors

David Burstein
View author publications
You can also search for this author in PubMed Google Scholar
Igor Ulitsky
View author publications
You can also search for this author in PubMed Google Scholar
Tamir Tuller
View author publications
You can also search for this author in PubMed Google Scholar
Benny Chor
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, 108-8639, Minato-ku, Tokyo, Japan
Satoru Miyano
Broad Institute of MIT and Harvard, 320 Charles Street, 02141-2023, Cambridge, MA, USA
Jill Mesirov
Computational Genomics Laboratory, Department of Bioengineering, Boston University, 44 Cummington St., 02215, Boston, MA, USA
Simon Kasif
Center for Molecular Biology and Computer Sciecne Department, Brown University, 115 Waterman St., 02912, Providence, RI, USA
Sorin Istrail
University of California, San Diego, USA
Pavel A. Pevzner
Department of Molecular and Computational Biology, University of Southern California, 1050 Childs Way, 90089-2910, Los Angeles, CA, USA
Michael Waterman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Burstein, D., Ulitsky, I., Tuller, T., Chor, B. (2005). Information Theoretic Approaches to Whole Genome Phylogenies. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds) Research in Computational Molecular Biology. RECOMB 2005. Lecture Notes in Computer Science(), vol 3500. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11415770_22

Download citation

DOI: https://doi.org/10.1007/11415770_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25866-7
Online ISBN: 978-3-540-31950-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Information Theoretic Approaches to Whole Genome Phylogenies

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Computational Approaches to Studying Molecular Phylogenetics

A sequence-based evolutionary distance method for Phylogenetic analysis of highly divergent proteins

PQ, a new program for phylogeny reconstruction

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Information Theoretic Approaches to Whole Genome Phylogenies

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Computational Approaches to Studying Molecular Phylogenetics

A sequence-based evolutionary distance method for Phylogenetic analysis of highly divergent proteins

PQ, a new program for phylogeny reconstruction

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation