Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes

Anthony J. Cox²¹,
Tobias Jakobi²²,
Giovanna Rosone²³ &
…
Ole B. Schulz-Trieglaff²¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 7534))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

2339 Accesses
13 Altmetric

Abstract

Popular sequence alignment tools such as BWA convert a reference genome to an indexing data structure based on the Burrows-Wheeler Transform (BWT), from which matches to individual query sequences can be rapidly determined. However the utility of also indexing the query sequences themselves remains relatively unexplored.

Here we show that an all-against-all comparison of two sequence collections can be computed from the BWT of each collection with the BWTs held entirely in external memory, i.e. on disk and not in RAM. As an application of this technique, we show that BWTs of transcriptomic and genomic reads can be compared to obtain reference-free predictions of splice junctions that have high overlap with results from more standard reference-based methods.

Code to construct and compare the BWT of large genomic data sets is available at http://beetl.github.com/BEETL/ as part of the BEETL library.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Indexing and searching petabase-scale nucleotide resources

Article 16 May 2024

Ultra-fast genome comparison for large-scale genomic experiments

Article Open access 16 July 2019

From Indexing Data Structures to de Bruijn Graphs

References

Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)
Article MathSciNet MATH Google Scholar
Adjeroh, D., Bell, T., Mukherjee, A.: The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching, 1st edn. Springer Publishing Company, Incorporated (2008)
Book Google Scholar
Bauer, M.J., Cox, A.J., Rosone, G.: Lightweight BWT Construction for Very Large String Collections. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 219–231. Springer, Heidelberg (2011)
Chapter Google Scholar
Bauer, M.J., Cox, A.J., Rosone, G.: Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science (2012) (online February 10, 2012)
Google Scholar
Burrows, M., Wheeler, D.J.: A block sorting data compression algorithm. Technical report, DIGITAL System Research Center (1994)
Google Scholar
Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)
Article Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 390–398. IEEE Computer Society, Washington, DC (2000)
Chapter Google Scholar
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), R25+ (2009)
Google Scholar
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)
Article Google Scholar
Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K., Wang, J.: Soap2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)
Article Google Scholar
Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the Burrows-Wheeler Transform. Theor. Comput. Sci. 387(3), 298–312 (2007)
Article MathSciNet MATH Google Scholar
Murchison, E.P., Schulz-Trieglaff, O.B., Ning, Z., Alexandrov, L.B., Bauer, M.J., Fu, B., Hims, M., Ding, Z., Ivakhno, S., Stewart, C., Ng, B.L., Wong, W., Aken, B., White, S., Alsop, A., Becq, J., Bignell, G.R., Cheetham, R.K., Cheng, W., Connor, T.R., Cox, A.J., Feng, Z., Gu, Y., Grocock, R.J., Harris, S.R., Khrebtukova, I., Kingsbury, Z., Kowarsky, M., Kreiss, A., Luo, S., Marshall, J., McBride, D.J., Murray, L., Pearse, A., Raine, K., Rasolonjatovo, I., Shaw, R., Tedder, P., Tregidgo, C., Vilella, A.J., Wedge, D.C., Woods, G.M., Gormley, N., Humphray, S., Schroth, G., Smith, G., Hall, K., Searle, S.M.J., Carter, N.P., Papenfuss, A.T., Futreal, P.A., Campbell, P.J., Yang, F., Bentley, D.R., Evers, D.J., Stratton, M.R.: Genome sequencing and analysis of the tasmanian devil and its transmissible cancer. Cell 148(4), 780–791 (2012)
Article Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1) (2007)
Google Scholar
Quinlan, A.R., Hall, I.M.: Bedtools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6), 841–842 (2010)
Article Google Scholar
Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)
Google Scholar
Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Research 22(3), 549–556 (2011)
Article Google Scholar
Trapnell, C., Pachter, L., Salzberg, S.L.: Tophat: discovering splice junctions with rna-seq. Bioinformatics 25(9), 1105–1111 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Illumina Cambridge Ltd., United Kingdom
Anthony J. Cox & Ole B. Schulz-Trieglaff
Computational Genomics, CeBiTec, Bielefeld University, Germany
Tobias Jakobi
Dipartimento di Matematica e Informatica, University of Palermo, Italy
Giovanna Rosone

Authors

Anthony J. Cox
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Jakobi
View author publications
You can also search for this author in PubMed Google Scholar
Giovanna Rosone
View author publications
You can also search for this author in PubMed Google Scholar
Ole B. Schulz-Trieglaff
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Brown University, P.O. Box 1910, 02912, Providence, CA, USA
Ben Raphael
Department of Computer Science and Engineering, University of South Carolina, 301 Main Street, 29208, Columbia, SC, USA
Jijun Tang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cox, A.J., Jakobi, T., Rosone, G., Schulz-Trieglaff, O.B. (2012). Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes. In: Raphael, B., Tang, J. (eds) Algorithms in Bioinformatics. WABI 2012. Lecture Notes in Computer Science(), vol 7534. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33122-0_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-33122-0_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33121-3
Online ISBN: 978-3-642-33122-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Indexing and searching petabase-scale nucleotide resources

Ultra-fast genome comparison for large-scale genomic experiments

From Indexing Data Structures to de Bruijn Graphs

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Indexing and searching petabase-scale nucleotide resources

Ultra-fast genome comparison for large-scale genomic experiments

From Indexing Data Structures to de Bruijn Graphs

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation