[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

The compressed permuterm index

Published: 08 December 2010 Publication History

Abstract

The Permuterm index [Garfield 1976] is a time-efficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wild-card symbol (called Tolerant Retrieval problem). Unfortunately the Permuterm index is space inefficient because it quadruples the dictionary size. In this article we propose the Compressed Permuterm Index which solves the Tolerant Retrieval problem in time proportional to the length of the searched pattern, and space close to the kth order empirical entropy of the indexed dictionary. We also design a dynamic version of this index that allows to efficiently manage insertion in, and deletion from, the dictionary of individual strings.
The result is based on a simple variant of the Burrows-Wheeler Transform, defined on a dictionary of strings of variable length, that allows to efficiently solve the Tolerant Retrieval problem via known (dynamic) compressed indexes [Navarro and Mäkinen 2007]. We will complement our theoretical study with a significant set of experiments that show that the Compressed Permuterm Index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip or bzip. This improves known approaches based on Front-Coding [Witten et al. 1999] by more than 50% in absolute space occupancy, still guaranteeing comparable query time.

References

[1]
Baeza-Yates, R., and Gonnet, G. 1996. Fast text searching for regular expressions or automaton searching on tries. J. ACM 43, 6, 915--936.
[2]
Baeza-Yates, R., and Ribeiro-Neto, B. 1999. Modern Information Retrieval. ACM/Addison-Wesley.
[3]
Barbay, J., He, M., Munro, J., and Rao, S. S. 2007. Succinct indexes for string, binary relations and multi-labeled trees. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA). 680--689.
[4]
Bentley, J. L., and Sedgewick, R. 1997. Fast algorithms for sorting and searching strings. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA). 360--369.
[5]
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004. Ubicrawler: A scalable fully distributed web crawler. Softw.: Pract. Exper. 34, 8, 711--726.
[6]
Burrows, M., and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. rep. No. 124, Digital Equipment Corporation.
[7]
Chan, H., Hon, W., Lam, T., and Sadakane, K. 2007. Compressed indexes for dynamic text collections. ACM Trans. Algor. 3, 2.
[8]
Ferragina, P., Giancarlo, R., Manzini, G., and Sciortino, M. 2005. Boosting textual compression in optimal linear time. J ACM 52, 688--713.
[9]
Ferragina, P., Koudas, N., Muthukrishnan, S., and Srivastava, D. 2003. Two-dimensional substring indexing. J Compu. Syst. Sci. 66, 4, 763--774.
[10]
Ferragina, P., and Manzini, G. 2005. Indexing compressed text. J. ACM 52, 4, 552--581.
[11]
Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2007. Compressed representations of sequences and full-text indexes. ACM Trans. Algo. 3, 2.
[12]
Ferragina, P., and Navarro, G. 2006. Pizza&Chili corpus homepage. http://pizzachili.dcc. uchile.cl/ or http://pizzachili.di.unipi.it/.
[13]
Garfield, E. 1976. The permuterm subject index: An autobiographical review. J. ACM 27, 288--291.
[14]
González, R., and Navarro, G. 2008. Improved dynamic rank-select entropy-bound structures. In Proceedings of the Latin American Symposium on Theoretical Informatics (LATIN). Lecture Notes in Computer Science, vol. 4957, Springer, 374--386.
[15]
Mäkinen, V., and Navarro, G. 2008. Dynamic entropy-compressed sequences and full-text indexes. ACM Trans. Algo. 4, 3.
[16]
Manning, C. D., Raghavan, P., and Schülze, H. 2008. Introduction to Information Retrieval. Cambridge University Press.
[17]
Mantaci, S., Restivo, A., Rosone, G., and Sciortino, M. 2005. An extension of the burrows wheeler transform and applications to sequence comparison and data compression. In Proceedings of Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 3537. Springer, 178--189.
[18]
Manzini, G. 2001. An analysis of the Burrows-Wheeler transform. J. ACM 48, 3, 407--430.
[19]
Navarro, G., and Mäkinen, V. 2007. Compressed full text indexes. ACM Comput. Surv. 39, 1.
[20]
Puglisi, S., Smyth, W., and Turpin, A. 2007. A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39, 2.
[21]
Sadakane, K. 2007. Succinct data structures for flexible text retrieval systems. J. Discr. Algorithms 5, 1, 12--22.
[22]
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Algorithms
ACM Transactions on Algorithms  Volume 7, Issue 1
November 2010
282 pages
ISSN:1549-6325
EISSN:1549-6333
DOI:10.1145/1868237
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2010
Accepted: 01 August 2008
Revised: 01 July 2008
Received: 01 December 2007
Published in TALG Volume 7, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Burrows-Wheeler transform
  2. compressed index
  3. indexing data structure
  4. permuterm
  5. string dictionary

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)3
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)CoCo-trieInformation Systems10.1016/j.is.2023.102316120:COnline publication date: 1-Feb-2024
  • (2023)A new class of string transformations for compressed text indexingInformation and Computation10.1016/j.ic.2023.105068294(105068)Online publication date: Oct-2023
  • (2022)Space/time-efficient RDF stores based on circular suffix sortingThe Journal of Supercomputing10.1007/s11227-022-04890-w79:5(5643-5683)Online publication date: 25-Oct-2022
  • (2022)On the Complexity of Recognizing Wheeler GraphsAlgorithmica10.1007/s00453-021-00917-584:3(784-814)Online publication date: 1-Mar-2022
  • (2022)Space Efficient Merging of de Bruijn Graphs and Wheeler GraphsAlgorithmica10.1007/s00453-021-00855-284:3(639-669)Online publication date: 1-Mar-2022
  • (2022)Compressed String Dictionaries via Data-Aware Subtrie CompactionString Processing and Information Retrieval10.1007/978-3-031-20643-6_17(233-249)Online publication date: 8-Nov-2022
  • (2021)Privacy-aware Character Pattern Matching over Outsourced Encrypted DataDigital Threats: Research and Practice10.1145/34623333:1(1-38)Online publication date: 22-Oct-2021
  • (2021)Accurate Cardinality Estimation of Co-occurring Words Using Suffix TreesDatabase Systems for Advanced Applications10.1007/978-3-030-73197-7_50(721-737)Online publication date: 6-Apr-2021
  • (2020)Lightweight merging of compressed indices based on BWT variantsTheoretical Computer Science10.1016/j.tcs.2019.11.001812(214-229)Online publication date: Apr-2020
  • (2019)Autocompletion for Prefix-Abbreviated InputProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3319858(211-228)Online publication date: 25-Jun-2019
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media