[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Linear-time String Indexing and Analysis in Small Space

Published: 09 March 2020 Publication History

Abstract

The field of succinct data structures has flourished over the past 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. For example, one can compare two genomes by building a common index for their concatenation and by detecting common substructures by querying the index. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing.
We report the following advances in string indexing and analysis: We show that the BWT of a string T ∈ {1,…,σ}n can be built in deterministic O(n) time using just O(n log σ) bits of space, where σ ≤ n. Deterministic linear time is achieved by exploiting a new partial rank data structure that supports queries in constant time and that might have independent interest. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of T. Many fundamental string analysis problems, such as maximal repeats, maximal unique matches, and string kernels, can be mapped to such enumeration and can thus be solved in deterministic O(n) time and in O(n log σ) bits of space from the input string by tailoring the enumeration algorithm to some problem-specific computations.
We also show how to build many of the existing indexes based on the BWT, such as the compressed suffix array, the compressed suffix tree, and the bidirectional BWT index, in randomized O(n) time and in O(n log σ) bits of space. The previously fastest construction algorithms for BWT, compressed suffix array and compressed suffix tree, which used O(n log σ) bits of space, took O(n log log σ) time for the first two structures and O(n log ϵn) time for the third, where ϵ is any positive constant smaller than one. Alternatively, the BWT could be previously built in linear time if one was willing to spend O(n log σ log log σ n) bits of space. Contrary to the state-of-the-art, our bidirectional BWT index supports every operation in constant time per element in its output.

References

[1]
Amihood Amir, Gad M. Landau, Moshe Lewenstein, and Dina Sokol. 2007. Dynamic text and static pattern matching. ACM Trans. Algor. 3, 2 (2007), 19.
[2]
Alberto Apostolico. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words (NATO ISI Series). Springer-Verlag, 85--96.
[3]
Brenda S. Baker. 1995. On finding duplication and near-duplication in large software systems. In Proceedings of the 2nd Working Conference on Reverse Engineering. IEEE, 86--95.
[4]
Carl Barton, Alice Heliou, Laurent Mouchard, and Solon P. Pissis. 2014. Linear-time computation of minimal absent words using suffix array. BMC Bioinf. 15, 1 (2014), 388.
[5]
Djamal Belazzougui. 2014. Linear time construction of compressed text indices in compact space. In Proceedings of the 46th ACM Symposium on Theory of Computing (STOC’14). ACM, 148--193.
[6]
Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. 2009. Monotone minimal perfect hashing: Searching a sorted table with O(1) accesses. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA’09). ACM-SIAM, 785--794.
[7]
Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. 2011. Theory and practice of monotone minimal perfect hashing. J. Exper. Algor. 16 (2011), 3--2.
[8]
Djamal Belazzougui and Fabio Cunial. 2014. Indexed matching statistics and shortest unique substrings. In Proceedings of the Symposium on String Processing and Information Retrieval (SPIRE’14). Springer, 179--190.
[9]
Djamal Belazzougui and Fabio Cunial. 2017. A framework for space-efficient string kernels. Algorithmica 79, 3, 857--883.
[10]
Djamal Belazzougui and Fabio Cunial. 2019. Fully-functional bidirectional Burrows-Wheeler indexes and infinite-order de Bruijn graphs. In Proceedings of the 30th Symposium on Combinatorial Pattern Matching (CPM’19). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[11]
Djamal Belazzougui and Gonzalo Navarro. 2014. Alphabet-independent compressed text indexing. ACM Trans. Algor. 10, 4 (2014), 23:1--23:19.
[12]
Djamal Belazzougui, Gonzalo Navarro, and Daniel Valenzuela. 2013. Improved compressed indexes for full-text document retrieval. J. Disc. Algor. 18 (2013), 3--13.
[13]
Timo Beller, Katharina Berger, and Enno Ohlebusch. 2012. Space-efficient computation of maximal and supermaximal repeats in genome sequences. In Proceedings of the 19th International Symposium on String Processing and Information Retrieval (SPIRE’12), Vol. 7608. Springer, 99--110.
[14]
Timo Beller, Simon Gog, Enno Ohlebusch, and Thomas Schnattinger. 2013. Computing the longest common prefix array based on the Burrows-Wheeler transform. J. Disc. Algor. 18 (2013), 22--31.
[15]
Andrej Brodnik. 1993. Computation of the least significant set bit. In Proceedings of the 2nd Electrotechnical and Computer Science Conference, Vol. 90.
[16]
Michael Burrows and David Wheeler. 1994. A Block Sorting Lossless Data Compression Algorithm. Technical Report 124. Digital Equipment Corporation.
[17]
David Clark. 1996. Compact Pat Trees. Ph.D. Dissertation. University of Waterloo, Canada.
[18]
Maxime Crochemore, Roberto Grossi, Juha Kärkkäinen, and Gad M. Landau. 2015. Computing the Burrows--Wheeler transform in place and in small space. J. Disc. Algor. 32 (2015), 44--52.
[19]
Maxime Crochemore, Filippo Mignosi, and Antonio Restivo. 1998. Automata and forbidden words. Inf. Proc. Lett. 67, 3 (1998), 111--117.
[20]
Peter Elias. 1974. Efficient storage and retrieval by content and address of static files. J. ACM 21, 2 (1974), 246--260.
[21]
Peter Elias. 1975. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theor. 21, 2 (1975), 194--203.
[22]
Robert M. Fano. 1971. On the Number of Bits Required to Implement an Associative Memory. Computer Structures Group, Project MAC, MIT, Cambridge, MA.
[23]
Martin Farach. 1997. Optimal suffix tree construction with large alphabets. In Proceedings of the Symposium on Foundations of Computer Science (FOCS’97). IEEE Computer Society, 137--143.
[24]
Paolo Ferragina and Giovanni Manzini. 2000. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science (FOCS’00). IEEE, 390--398.
[25]
Paolo Ferragina and Giovanni Manzini. 2005. Indexing compressed texts. J. ACM 52, 4 (2005), 552--581.
[26]
Paolo Ferragina and Rossano Venturini. 2007. A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci. 372 (2007), 115--121.
[27]
Johannes Fischer. 2010. Optimal succinctness for range minimum queries. In Proceedings of the Latin American Theoretical Informatics Symposium (LATIN’10). Springer, 158--169.
[28]
Johannes Fischer. 2011. Combined data structure for previous- and next-smaller-values. Theor. Comput. Sci. 412, 22 (2011), 2451--2456.
[29]
Alexander Golynski, J. Ian Munro, and S. Srinivasa Rao. 2006. Rank/select operations on large alphabets: A tool for text indexing. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA’06). ACM, 368--373.
[30]
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. 2003. High-order entropy-compressed text indexes. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA’03). SIAM, 841--850.
[31]
Roberto Grossi, Alessio Orlandi, Rajeev Raman, and S. Srinivasa Rao. 2009. More haste, less waste: Lowering the redundancy in fully indexable dictionaries. arXiv preprint arXiv:0902.2648.
[32]
Roberto Grossi and Jeffrey Scott Vitter. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2 (2005), 378--407.
[33]
D. Gusfield. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge, UK.
[34]
Torben Hagerup, Peter Bro Miltersen, and Rasmus Pagh. 2001. Deterministic dictionaries. J. Algor. 41, 1 (2001), 69--85.
[35]
T. Hagerup and T. Tholey. 2001. Efficient minimal perfect hashing in nearly minimal space. In Proceedings of the Symposium on Theoretical Aspects of Computer Science (STACS’01). Springer-Verlag, 317--326.
[36]
David Haussler. 1999. Convolution Kernels on Discrete Structures. Technical Report. UC Santa Cruz.
[37]
Charles A. R. Hoare. 1962. Quicksort. Comput. J. 5, 1 (1962), 10--16.
[38]
Wing-Kai Hon and Kunihiko Sadakane. 2002. Space-economical algorithms for finding maximal unique matches. In Proceedings of the 13th Symposium on Combinatorial Pattern Matching (CPM’02) (LNCS), Vol. 2373. Springer, 144--152.
[39]
Wing-Kai Hon, Kunihiko Sadakane, and Wing-Kin Sung. 2009. Breaking a time-and-space barrier in constructing full-text indices. SIAM J. Comput. 38, 6 (2009), 2162--2178.
[40]
Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. 2006. Linear work suffix array construction. J. ACM 53, 6 (2006), 918--936.
[41]
Dominik Kempa. 2019. Optimal construction of compressed indexes for highly repetitive texts. In Proceedings of the 30th ACM-SIAM Symposium on Discrete Algorithms (SODA’19). SIAM, 1344--1357.
[42]
Dominik Kempa and Tomasz Kociumaka. 2019. String synchronizing sets: Sublinear-time BWT construction and optimal LCE data structure. In Proceedings of the 51st ACM SIGACT Symposium on Theory of Computing (STOC’19). ACM, 756--767.
[43]
Dong Kyue Kim, Jeong Seop Sim, Heejin Park, and Kunsoo Park. 2005. Constructing suffix arrays in linear time. J. Disc. Algor. 3, 2 (2005), 126--142.
[44]
Pang Ko and Srinivas Aluru. 2003. Space efficient linear time construction of suffix arrays. In Proceedings of the Symposium on Combinatorial Pattern Matching (CPM’03). Springer, 200--210.
[45]
M. Oguzhan Kulekci, Jeffrey Scott Vitter, and Bojian Xu. 2012. Efficient maximal repeat finding using the Burrows-Wheeler transform and wavelet tree. IEEE/ACM Trans. Comput. Biol. Bioinf. 9, 2 (2012), 421--429.
[46]
Tak Wah Lam, Ruiqiang Li, Alan Tam, Simon Wong, Edward Wu, and Siu-Ming Yiu. 2009. High throughput short read alignment via bi-directional BWT. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM’09). IEEE, 31--36.
[47]
Ruiqiang Li, Chang Yu, Yingrui Li, Tak Wah Lam, Siu-Ming Yiu, Karsten Kristiansen, and Jun Wang. 2009. SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics 25, 15 (2009), 1966--1967.
[48]
Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. Text classification using string kernels. J. Mach. Learn. Res. 2 (2002), 419--444.
[49]
Udi Manber and Eugene W. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5 (1993), 935--948.
[50]
Giovanni Manzini. 2001. An analysis of the Burrows-Wheeler transform. J. ACM 48, 3 (2001), 407--430.
[51]
I. Munro. 1996. Tables. In Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science (LNCS), Vol. 1180. Springer, 37--42.
[52]
J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. 2017. Space-efficient construction of compressed indexes in deterministic linear time. In Proceedings of the 28th ACM-SIAM Symposium on Discrete Algorithms (SODA’17). SIAM, 408--424.
[53]
J. Ian Munro, Rajeev Raman, Venkatesh Raman, and Satti Srinivasa Rao. 2003. Succinct representations of permutations. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP’03). Springer, 345--356.
[54]
J. Ian Munro and Venkatesh Raman. 2001. Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31, 3 (2001), 762--776.
[55]
S. Muthukrishnan. 2002. Efficient algorithms for document retrieval problems. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA’02). ACM-SIAM, 657--666.
[56]
G. Navarro and V. Mäkinen. 2007. Compressed full-text indexes. Comput. Surv. 39, 1 (2007), Article 2.
[57]
Gonzalo Navarro and Yakov Nekrich. 2014. Optimal dynamic sequence representations. SIAM J. Comput. 43, 5 (2014), 1781--1806.
[58]
Gonzalo Navarro and Kunihiko Sadakane. 2014. Fully functional static and dynamic succinct trees. ACM Trans. Algor. 10, 3 (2014), 16:1--16:39.
[59]
Enno Ohlebusch, Timo Beller, and Mohamed Ibrahim Abouelhoda. 2014. Computing the Burrows-Wheeler transform of a string and its reverse in parallel. J. Disc. Algor. 25 (2014), 21--33.
[60]
Daisuke Okanohara and Kunihiko Sadakane. 2007. Practical entropy-compressed rank/select dictionary. In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX’07). SIAM, 60--70.
[61]
Daisuke Okanohara and Kunihiko Sadakane. 2009. A linear-time Burrows-Wheeler transform using induced sorting. In Proceedings of the Symposium on String Processing and Information Retrieval (SPIRE’09) (LNCS), Vol. 5721. Springer, 90--101.
[62]
Daisuke Okanohara and Jun’ichi Tsujii. 2009. Text categorization with all substring features. In Proceedings of theSIAM International Conference on Data Mining (SDM’09). SIAM, 838--846.
[63]
Rasmus Pagh. 2001. Low redundancy in static dictionaries with constant query time. SIAM J. Comput. 31, 2 (2001), 353--363.
[64]
Mihai Patrascu. 2008. Succincter. In Proceedings of the 49th IEEE Symposium on Foundations of Computer Science (FOCS’08). IEEE, 305--313.
[65]
Nicola Prezza and Giovanna Rosone. 2019. Space-efficient computation of the LCP array from the Burrows-Wheeler transform. In Proceedings of the 30th Symposium on Combinatorial Pattern Matching (CPM’19). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[66]
M. M. Robertson. 1968. A generalization of quasi-monotone sequences. Proc. Edinburgh Math. Soc. (Series 2) 16, 01 (1968), 37--41.
[67]
K. Sadakane. 2000. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceedings of the 11th International Symposium on Algorithms and Computation (ISAAC’00) (LNCS), Vol. 1969. Springer, 410--421.
[68]
Kunihiko Sadakane. 2002. Succinct representations of LCP information and improvements in the compressed suffix arrays. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA’02). ACM-SIAM, 225--232.
[69]
Kunihiko Sadakane. 2007. Compressed suffix trees with full functionality. Theor. Comput. Syst. 41, 4 (2007), 589--607.
[70]
Kunihiko Sadakane. 2007. Succinct data structures for flexible text retrieval systems. J. Disc. Algor. 5, 1 (2007), 12--22.
[71]
Kunihiko Sadakane and Gonzalo Navarro. 2010. Fully-functional succinct trees. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA’10). ACM-SIAM, 134--149.
[72]
Thomas Schnattinger, Enno Ohlebusch, and Simon Gog. 2012. Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Inf. Comput. 213 (2012), 13--22.
[73]
Peter Weiner. 1973. The file transmission problem. In Proceedings of the National Computer Conference and Exposition. ACM, 453--453.
[74]
P. Weiner. 1973. Linear pattern matching algorithms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory. IEEE, 1--11.
[75]
Dan E. Willard. 1983. Log-logarithmic worst-case range queries are possible in space Θ(N). Inf. Proc. Lett. 17, 2 (1983), 81--84.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Algorithms
ACM Transactions on Algorithms  Volume 16, Issue 2
April 2020
372 pages
ISSN:1549-6325
EISSN:1549-6333
DOI:10.1145/3386689
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2020
Accepted: 01 November 2019
Revised: 01 October 2019
Received: 01 October 2018
Published in TALG Volume 16, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Burrows-Wheeler transform
  2. Compact data structures
  3. bidirectional BWT index
  4. compressed indexes
  5. compressed suffix array
  6. compressed suffix tree
  7. matching statistics
  8. maximal exact match
  9. maximal repeat
  10. maximal unique match
  11. minimal absent word
  12. monotone minimal perfect hash function
  13. partial rank query
  14. string kernel
  15. suffix array
  16. suffix tree
  17. suffix-link tree

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Academy of Finland

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Finding maximal exact matches in graphsAlgorithms for Molecular Biology10.1186/s13015-024-00255-519:1Online publication date: 11-Mar-2024
  • (2024)Elastic founder graphs improved and enhancedTheoretical Computer Science10.1016/j.tcs.2023.114269982(114269)Online publication date: Jan-2024
  • (2024)On Approximate Colored Path CountingLATIN 2024: Theoretical Informatics10.1007/978-3-031-55598-5_14(209-224)Online publication date: 18-Mar-2024
  • (2023)Text Indexing for Long Patterns: Anchors are All you NeedProceedings of the VLDB Endowment10.14778/3598581.359858616:9(2117-2131)Online publication date: 1-May-2023
  • (2023)Chaining of Maximal Exact Matches in GraphsString Processing and Information Retrieval10.1007/978-3-031-43980-3_29(353-366)Online publication date: 20-Sep-2023
  • (2022)Algorithms and Complexity on Indexing Founder GraphsAlgorithmica10.1007/s00453-022-01007-w85:6(1586-1623)Online publication date: 28-Jul-2022
  • (2021)Reversed Lempel–Ziv Factorization with Suffix TreesAlgorithms10.3390/a1406016114:6(161)Online publication date: 21-May-2021
  • (2021)Worst-Case Optimal Graph Joins in Almost No SpaceProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457256(102-114)Online publication date: 9-Jun-2021
  • (2021)Comparative genome analysis using sample-specific string detection in accurate long readsBioinformatics Advances10.1093/bioadv/vbab0051:1Online publication date: 31-May-2021
  • (2021)The Genome Atlas: Navigating a New Era of Reference GenomesTrends in Genetics10.1016/j.tig.2020.12.002Online publication date: Jan-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media