More Web Proxy on the site http://driver.im/

research-article

Efficient Algorithms for Sequence Analysis with Entropic Profiles

Authors:

Mattia Ornamenti,

Simone Spangaro,

Simona E. Rombo,

Laxmi ParidaAuthors Info & Claims

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), Volume 15, Issue 1

Pages 117 - 128

https://doi.org/10.1109/TCBB.2016.2620143

Published: 01 January 2018 Publication History

Abstract

Entropy, being closely related to repetitiveness and compressibility, is a widely used information-related measure to assess the degree of predictability of a sequence. Entropic profiles are based on information theory principles, and can be used to study the under-/over-representation of subwords, by also providing information about the scale of conserved DNA regions. Here, we focus on the algorithmic aspects related to entropic profiles. In particular, we propose linear time algorithms for their computation that rely on suffix-based data structures, more specifically on the truncated suffix tree TST and on the enhanced suffix array ESA. We performed an extensive experimental campaign showing that our algorithms, beside being faster, make it possible the analysis of longer sequences, even for high degrees of resolution, than state of the art algorithms.

References

[1]

M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch, "Replacing suffix trees with enhanced suffix arrays," J. Discrete Algorithms, vol. 2, pp. 53-86, 2004.

[2]

A. Allali and M.-F. Sagot, "The at most k deep factor tree," 2004, https://hal.archives-ouvertes.fr/hal-00627813/document

[3]

A. Apostolico, "The myriad virtues of subword trees," Combinatorial Algorithms Words, vol. 12, pp. 85-96, 1985.

[4]

A. Apostolico, M. E. Bock, and S. Lonardi, "Monotony of surprise and large-scale quest for unusual words," J. Comput. Biol., vol. 10, no. 3/4, pp. 283-311, 2003.

[5]

A. Apostolico, C. Guerra, and C. Pizzi, "Alignment free sequence similarity with bounded hamming distance," in Proc. IEEE Data Compression Conf., 2014, pp. 183-192.

[6]

A. Apostolico, and L. Parida, "Incremental paradigms of motif discovery," J. Comput. Biol., vol. 11, no. 1, pp. 15-25, 2004.

[7]

A. Apostolico and C. Pizzi, "Monotone acoring of pattern with mismatche,". in Proc. 4th Int. Workshop Algorithms Bio inf., 2004, pp 87-98.

[8]

A. Apostolico, C. Pizzi, and E. Ukkonen, "Efficient algorithms for the discovery of gapped factors," Algorithms Mol. Biol., vol. 6, 2011, Art. no. 5.

[9]

S. Bortoluzzi, A. Coppe, A. Bisognin, C. Pizzi, and G. A. Danieli, "A multistep bioinformatic approach detects putative regulatory elements in gene promoters," BMC Bioinf., vol. 6, 2005, Art. no. 121.

[10]

M. Comin and M. Antonello, "Fast Entropic profiler: An information theoretic approach for the discovery of patterns in genomes," IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 11, no. 3, pp. 500-509, May/Jun. 2014.

[11]

M. Comin and M. Antonello, "On the comparison of regulatory sequences with multiple resolution Entropic Profiles," BMC Bioinf., vol. 17, 2016, Art. no. 130.

[12]

T. Davidsen, E. A. Rodland, K. Lagesen, E. Seeberg, T. Rognes, and T. Tonjum, "Biased distribution of DNA uptake sequences towards genome maintenance genes," Nucleic Acids Res., vol. 32, no. 3, pp. 1050-1058, 2004.

[13]

M. Federico and N. Pisanti, "Suffix tree characterization of maximal motifs in biological sequences," Theoretical Comput. Sci., vol. 410, no. 43, pp. 4391-4401, 2009.

[14]

F. Fernandes, A. T. Freitas, J. S. Almeida, and S. Vinga, "Entropic profiler detection of conservation in genomes using information theory," BMC Res. Notes, vol. 2, 2009, Art. no. 72.

[15]

H. Herzel, W. Ebeling, and A. O. Schmitt, "Entropies of biosequences: The role of repeats," Phys. Rev., vol. 50, pp. 5061-5071, Dec. 1994.

[16]

R. Giancarlo, S. E. Rombo, and F. Utro, "Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies," Briefings Bioinf., vol. 15, no. 3, pp. 390- 406, 2014.

[17]

R. Grossi, A. Pietracaprina, N. Pisanti, G. Pucci, E. Upfal, and F. Vandin, "MADMX: A strategy for maximal dense motif extraction," J. Comput. Biol., vol. 18, no. 4, pp. 535-545, 2011.

[18]

R. Grossi, N. Pisanti, M. Crochemore, and M.-F. Sagot, "Bases of motifs for generating repeated patterns with wild cards," IEEE/ ACM Trans. Comput. Biol. Bioinf., vol. 2, no. 1, pp. 40-50, Jan.-Mar. 2005.

[19]

D. Gusfield, "Computer science and computational biology," in Algorithms on Strings, Trees and Sequences, Cambridge, U.K.: Cambridge Univ. Press, 1997.

[20]

B. Haubold, "Alignment-free phylogenetics and population genetics," Briefings Bioinf., vol 15, no. 3, pp. 407-418, 2014.

[21]

S. Horwege, et al., "Spaced words and KMACS: Fast alignment-free sequence comparison based on inexact word matches," Nucleic Acids Res., vol. 42, pp. W7-W11, 2014.

[22]

C.-A. Leimeister and B. Morgenstern, "KMACS: The k-mismatch average common substring approach to alignment-free sequence comparison," Bioinf., vol. 30, pp. 2000-2008, 2014.

[23]

U. Mamber and G. Myers, "Suffix arrays: A new method for online string searches," in Proc. ACM-SIAM Symp. Discrete Algorithms, 1990, pp. 319-327.

[24]

J. C. Na, A. Apostolico, C. S. Iliopulos, and K. Park, "Truncated suffix trees and their application to data compression," Theoretical Comput. Sci., vol. 304, no. 1/3, pp. 87-101, 2003.

[25]

L. Palopoli, S. E. Rombo, and G. Terracina, "Flexible pattern discovery with (extended) disjunctive logic programming," in Proc. Int. Symp. Methodologies Intell. Syst., 2005, pp. 504-513.

[26]

L. Parida, Pattern Discovery in Bioinformatics: Theory & Algorithms, Boca Raton, FL, USA: Chapman & Hall/CRC, 2007.

[27]

L. Parida, C. Pizzi, and S. E. Rombo, "Entropic profiles, maximal motifs and the discovery of significant repetitions in genomic sequences," in Proc. Workshop Algorithms Bioinf., 2014, pp. 148-160.

[28]

L. Parida, C. Pizzi, and S. E. Rombo, "Irredundant tandem motifs," Theoretical Comput. Sci., vol. 525, 2014, pp. 89-102.

[29]

Phylip package. (2016). [Online]. Available: http://evolution.genetics.washington.edu/phylip.html

[30]

L. Pinello, G. Lo Bosco, and G.-C. Yuan, "Applications of alignment-free methods in epigenomics," Briefings Bioinf., vol. 15, no. 3, pp. 419-430, 2014.

[31]

N. Pisanti, A. M. Carvalho, L. Marsan, and M.-F. Sagot, "RISOTTO: Fast Extraction of motifs with mismatches," in Proc. Latin Amer. Theoretical Inf. Symp., 2006, pp. 757-768.

[32]

C. Pizzi, "MissMax: Alignment-free sequences comparison with mismatches through filtering and heuristics," Algorithms Mol. Biol., vol. 11, 2016, Art. no. 6.

[33]

C. Pizzi, P. Rastas, and E. Ukkonen, "Finding significant matches of position weight matrices in linear time," IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 8, no. 1, pp. 69-79, Jan./Feb. 2011.

[34]

C. Pizzi, S. Bortoluzzi, A. Bisognin, A. Coppe, and G. A. Danieli, "Detecting seeded motifs in DNA sequences," Nucleic Acids Res., vol. 33, 2005, Art. no. e135.

[35]

S. E. Rombo, "Extracting string motif bases for quorum higher than two," Theoretical Comput. Sci., vol. 460, pp. 94-103, 2012.

[36]

I. Schwende and T. D. Pham, "Pattern recognition and probabilistic measures in alignment-free sequence analysis," Briefings Bioinf., vol. 15, no. 3, pp. 354-368, 2014.

[37]

G. R. Smith, "How RecBCD enzyme and Chi promote DNA break repair and recombination: a molecular biologist's view," Microbiol Mol. Biol. Rev., vol. 76, no. 2, pp. 217-28, 2012.

[38]

K. Song, J. Ren, G. Reinert, M. Deng, M. S. Waterman, and F. Sun, "New developments of alignment-free sequence comparison: Measures, statistics and next-generation sequencing," Briefings Bioinf., vol. 15, no. 3, pp. 343-353, 2014.

[39]

S. Sourice, V. Biaudet, M. El Karoui, S. D. Ehrlich, and A. Gruss, "Identification of the Chi site of Haemophilus influenzae as several sequences related to the Escherichia coli Chi site," Mol. Microbiol., vol. 27, pp. 1021-1029, 1998.

[40]

S. V. Thankachan, S. P. Chockalingam, Y. Liu, A. Apostolico, and S. Aluru, "ALFRED: A practical method for alignment-free distance computation," J. Comput. Biol., vol. 23, no. 6, pp. 452-460, 2016.

[41]

I. Ulitsky, D. Burstein, T. Tuller, and B. Chor, "The average common substring approach to phylogenomic reconstruction," J. Comput. Biol., vol. 13, no. 2, pp. 336-50, 2006.

[42]

E. Ukkonen, "On-line construction of suffix trees," Algorithmica, vol. 14, no. 3, pp. 249-260, 1995.

[43]

S. Vinga, "Editorial: Alignment-free methods in computational biology," Briefings Bioinf., vol. 15, no. 3, pp. 341-342, 2014.

[44]

S. Vinga, "Information theory applications for biological sequence analysis," Briefings Bioinf., vol. 15, no. 3, pp. 376-389, 2014.

[45]

S. Vinga and J. S. Almeida, "Local Renyi Entropic profiles of DNA sequences," BMC Bioinf., vol. 8, 2007, Art. no. 393.

Efficient Algorithms for Sequence Analysis with Entropic Profiles
1. Applied computing
  1. Life and medical sciences
2. Computing methodologies

Recommendations

Algorithms for sequence analysis via mutagenesis

Motivation: Despite many successes of conventional DNA sequencing methods, some DNAs remain difficult or impossible to sequence. Unsequenceable regions occur in the genomes of many biologically important organisms, including the human genome. Such ...
Fast Alignment-free Comparison for Regulatory Sequences using Multiple Resolution Entropic Profiles
BIOSTEC 2015: Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3

Enhancers are stretches of DNA (100-1000 bp) that play a major role in development gene expression, evolution and disease. It has been recently shown that in high-level eukaryotes enhancers rarely work alone, instead they collaborate by forming clusters ...
Identification of a Specific Base Sequence of Pathogenic E. Coli through a Genomic Analysis
DTMBIO '14: Proceedings of the ACM 8th International Workshop on Data and Text Mining in Bioinformatics

E. coli sequence type 131 (ST131) is one of pathogens that causes resistant infections. Comparative genome analyses allow interpretations of the virulence factors of pathogens. Thus, in this study, we analysis the genomic differences between the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 15, Issue 1

January 2018

352 pages

ISSN:1545-5963

Issue’s Table of Contents

Copyright © 2018.

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 January 2018

Published in TCBB Volume 15, Issue 1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
34
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents