[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Efficient Algorithms for Sequence Analysis with Entropic Profiles

Published: 01 January 2018 Publication History

Abstract

Entropy, being closely related to repetitiveness and compressibility, is a widely used information-related measure to assess the degree of predictability of a sequence. Entropic profiles are based on information theory principles, and can be used to study the under-/over-representation of subwords, by also providing information about the scale of conserved DNA regions. Here, we focus on the algorithmic aspects related to entropic profiles. In particular, we propose linear time algorithms for their computation that rely on suffix-based data structures, more specifically on the truncated suffix tree TST and on the enhanced suffix array ESA. We performed an extensive experimental campaign showing that our algorithms, beside being faster, make it possible the analysis of longer sequences, even for high degrees of resolution, than state of the art algorithms.

References

[1]
M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch, "Replacing suffix trees with enhanced suffix arrays," J. Discrete Algorithms, vol. 2, pp. 53-86, 2004.
[2]
A. Allali and M.-F. Sagot, "The at most k deep factor tree," 2004, https://hal.archives-ouvertes.fr/hal-00627813/document
[3]
A. Apostolico, "The myriad virtues of subword trees," Combinatorial Algorithms Words, vol. 12, pp. 85-96, 1985.
[4]
A. Apostolico, M. E. Bock, and S. Lonardi, "Monotony of surprise and large-scale quest for unusual words," J. Comput. Biol., vol. 10, no. 3/4, pp. 283-311, 2003.
[5]
A. Apostolico, C. Guerra, and C. Pizzi, "Alignment free sequence similarity with bounded hamming distance," in Proc. IEEE Data Compression Conf., 2014, pp. 183-192.
[6]
A. Apostolico, and L. Parida, "Incremental paradigms of motif discovery," J. Comput. Biol., vol. 11, no. 1, pp. 15-25, 2004.
[7]
A. Apostolico and C. Pizzi, "Monotone acoring of pattern with mismatche,". in Proc. 4th Int. Workshop Algorithms Bio inf., 2004, pp 87-98.
[8]
A. Apostolico, C. Pizzi, and E. Ukkonen, "Efficient algorithms for the discovery of gapped factors," Algorithms Mol. Biol., vol. 6, 2011, Art. no. 5.
[9]
S. Bortoluzzi, A. Coppe, A. Bisognin, C. Pizzi, and G. A. Danieli, "A multistep bioinformatic approach detects putative regulatory elements in gene promoters," BMC Bioinf., vol. 6, 2005, Art. no. 121.
[10]
M. Comin and M. Antonello, "Fast Entropic profiler: An information theoretic approach for the discovery of patterns in genomes," IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 11, no. 3, pp. 500-509, May/Jun. 2014.
[11]
M. Comin and M. Antonello, "On the comparison of regulatory sequences with multiple resolution Entropic Profiles," BMC Bioinf., vol. 17, 2016, Art. no. 130.
[12]
T. Davidsen, E. A. Rodland, K. Lagesen, E. Seeberg, T. Rognes, and T. Tonjum, "Biased distribution of DNA uptake sequences towards genome maintenance genes," Nucleic Acids Res., vol. 32, no. 3, pp. 1050-1058, 2004.
[13]
M. Federico and N. Pisanti, "Suffix tree characterization of maximal motifs in biological sequences," Theoretical Comput. Sci., vol. 410, no. 43, pp. 4391-4401, 2009.
[14]
F. Fernandes, A. T. Freitas, J. S. Almeida, and S. Vinga, "Entropic profiler detection of conservation in genomes using information theory," BMC Res. Notes, vol. 2, 2009, Art. no. 72.
[15]
H. Herzel, W. Ebeling, and A. O. Schmitt, "Entropies of biosequences: The role of repeats," Phys. Rev., vol. 50, pp. 5061-5071, Dec. 1994.
[16]
R. Giancarlo, S. E. Rombo, and F. Utro, "Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies," Briefings Bioinf., vol. 15, no. 3, pp. 390- 406, 2014.
[17]
R. Grossi, A. Pietracaprina, N. Pisanti, G. Pucci, E. Upfal, and F. Vandin, "MADMX: A strategy for maximal dense motif extraction," J. Comput. Biol., vol. 18, no. 4, pp. 535-545, 2011.
[18]
R. Grossi, N. Pisanti, M. Crochemore, and M.-F. Sagot, "Bases of motifs for generating repeated patterns with wild cards," IEEE/ ACM Trans. Comput. Biol. Bioinf., vol. 2, no. 1, pp. 40-50, Jan.-Mar. 2005.
[19]
D. Gusfield, "Computer science and computational biology," in Algorithms on Strings, Trees and Sequences, Cambridge, U.K.: Cambridge Univ. Press, 1997.
[20]
B. Haubold, "Alignment-free phylogenetics and population genetics," Briefings Bioinf., vol 15, no. 3, pp. 407-418, 2014.
[21]
S. Horwege, et al., "Spaced words and KMACS: Fast alignment-free sequence comparison based on inexact word matches," Nucleic Acids Res., vol. 42, pp. W7-W11, 2014.
[22]
C.-A. Leimeister and B. Morgenstern, "KMACS: The k-mismatch average common substring approach to alignment-free sequence comparison," Bioinf., vol. 30, pp. 2000-2008, 2014.
[23]
U. Mamber and G. Myers, "Suffix arrays: A new method for online string searches," in Proc. ACM-SIAM Symp. Discrete Algorithms, 1990, pp. 319-327.
[24]
J. C. Na, A. Apostolico, C. S. Iliopulos, and K. Park, "Truncated suffix trees and their application to data compression," Theoretical Comput. Sci., vol. 304, no. 1/3, pp. 87-101, 2003.
[25]
L. Palopoli, S. E. Rombo, and G. Terracina, "Flexible pattern discovery with (extended) disjunctive logic programming," in Proc. Int. Symp. Methodologies Intell. Syst., 2005, pp. 504-513.
[26]
L. Parida, Pattern Discovery in Bioinformatics: Theory & Algorithms, Boca Raton, FL, USA: Chapman & Hall/CRC, 2007.
[27]
L. Parida, C. Pizzi, and S. E. Rombo, "Entropic profiles, maximal motifs and the discovery of significant repetitions in genomic sequences," in Proc. Workshop Algorithms Bioinf., 2014, pp. 148-160.
[28]
L. Parida, C. Pizzi, and S. E. Rombo, "Irredundant tandem motifs," Theoretical Comput. Sci., vol. 525, 2014, pp. 89-102.
[29]
Phylip package. (2016). [Online]. Available: http://evolution.genetics.washington.edu/phylip.html
[30]
L. Pinello, G. Lo Bosco, and G.-C. Yuan, "Applications of alignment-free methods in epigenomics," Briefings Bioinf., vol. 15, no. 3, pp. 419-430, 2014.
[31]
N. Pisanti, A. M. Carvalho, L. Marsan, and M.-F. Sagot, "RISOTTO: Fast Extraction of motifs with mismatches," in Proc. Latin Amer. Theoretical Inf. Symp., 2006, pp. 757-768.
[32]
C. Pizzi, "MissMax: Alignment-free sequences comparison with mismatches through filtering and heuristics," Algorithms Mol. Biol., vol. 11, 2016, Art. no. 6.
[33]
C. Pizzi, P. Rastas, and E. Ukkonen, "Finding significant matches of position weight matrices in linear time," IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 8, no. 1, pp. 69-79, Jan./Feb. 2011.
[34]
C. Pizzi, S. Bortoluzzi, A. Bisognin, A. Coppe, and G. A. Danieli, "Detecting seeded motifs in DNA sequences," Nucleic Acids Res., vol. 33, 2005, Art. no. e135.
[35]
S. E. Rombo, "Extracting string motif bases for quorum higher than two," Theoretical Comput. Sci., vol. 460, pp. 94-103, 2012.
[36]
I. Schwende and T. D. Pham, "Pattern recognition and probabilistic measures in alignment-free sequence analysis," Briefings Bioinf., vol. 15, no. 3, pp. 354-368, 2014.
[37]
G. R. Smith, "How RecBCD enzyme and Chi promote DNA break repair and recombination: a molecular biologist's view," Microbiol Mol. Biol. Rev., vol. 76, no. 2, pp. 217-28, 2012.
[38]
K. Song, J. Ren, G. Reinert, M. Deng, M. S. Waterman, and F. Sun, "New developments of alignment-free sequence comparison: Measures, statistics and next-generation sequencing," Briefings Bioinf., vol. 15, no. 3, pp. 343-353, 2014.
[39]
S. Sourice, V. Biaudet, M. El Karoui, S. D. Ehrlich, and A. Gruss, "Identification of the Chi site of Haemophilus influenzae as several sequences related to the Escherichia coli Chi site," Mol. Microbiol., vol. 27, pp. 1021-1029, 1998.
[40]
S. V. Thankachan, S. P. Chockalingam, Y. Liu, A. Apostolico, and S. Aluru, "ALFRED: A practical method for alignment-free distance computation," J. Comput. Biol., vol. 23, no. 6, pp. 452-460, 2016.
[41]
I. Ulitsky, D. Burstein, T. Tuller, and B. Chor, "The average common substring approach to phylogenomic reconstruction," J. Comput. Biol., vol. 13, no. 2, pp. 336-50, 2006.
[42]
E. Ukkonen, "On-line construction of suffix trees," Algorithmica, vol. 14, no. 3, pp. 249-260, 1995.
[43]
S. Vinga, "Editorial: Alignment-free methods in computational biology," Briefings Bioinf., vol. 15, no. 3, pp. 341-342, 2014.
[44]
S. Vinga, "Information theory applications for biological sequence analysis," Briefings Bioinf., vol. 15, no. 3, pp. 376-389, 2014.
[45]
S. Vinga and J. S. Almeida, "Local Renyi Entropic profiles of DNA sequences," BMC Bioinf., vol. 8, 2007, Art. no. 393.
  1. Efficient Algorithms for Sequence Analysis with Entropic Profiles

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
      IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 15, Issue 1
      January 2018
      352 pages

      Publisher

      IEEE Computer Society Press

      Washington, DC, United States

      Publication History

      Published: 01 January 2018
      Published in TCBB Volume 15, Issue 1

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 34
        Total Downloads
      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 01 Jan 2025

      Other Metrics

      Citations

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media