[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Practical Compact Indexes for Top-k Document Retrieval

Published: 02 March 2017 Publication History

Abstract

We present a fast and compact index for top-k document retrieval on general string collections, in which given a string pattern, the index returns the k documents where it appears most often. We adapt a linear-space and optimal-time theoretical solution, whose implementation poses various algorithm engineering challenges. Although a naive implementation of the optimal solution is estimated to require around 80n bytes for a text collection of n symbols, our implementation requires 2.5n to 3.0n bytes, text included, and answers queries within microseconds. This outperforms all previous practical indexes by orders of magnitude; the only index using less space is hundreds of times slower. Our index can be built on collections of hundreds of gigabytes and on tokenized text collections.

References

[1]
D. Arroyuelo, R. Cánovas, G. Navarro, and K. Sadakane. 2010. Succinct trees in practice. In Proceedings of the 11th Workshop on Algorithm Engineering and Experiments (ALENEX’10). 84--97.
[2]
R. Baeza-Yates and B. Ribeiro-Neto. 2011. Modern Information Retrieval (2nd ed.). Addison-Wesley.
[3]
D. Belazzougui, G. Navarro, and D. Valenzuela. 2013. Improved compressed indexes for full-text document retrieval. Journal of Discrete Algorithms 18, 3--13.
[4]
D. Benoit, E. D. Demaine, J. I. Munro, R. Raman, V. Raman, and S. S. Rao. 2005. Representing trees of higher degree. Algorithmica 43, 4, 275--292.
[5]
N. Brisaboa, G. de Bernardo, R. Konow, G. Navarro, and D. Seco. 2016. Aggregated 2D range queries on clustered points. Information Systems 60, 34--49.
[6]
N. Brisaboa, S. Ladra, and G. Navarro. 2013. DACs: Bringing direct access to variable-length codes. Information Processing and Management 49, 1, 392--404.
[7]
N. Brisaboa, S. Ladra, and G. Navarro. 2014. Compact representation of Web graphs with extended functionality. Information Systems 39, 1, 152--174.
[8]
S. Büttcher, C. Clarke, and G. Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, Cambridge, MA.
[9]
D. R. Clark. 1996. Compact Pat Trees. Ph.D. Dissertation. Waterloo, Ontario, Canada.
[10]
B. Croft, D. Metzler, and T. Strohman. 2009. Search Engines: Information Retrieval in Practice. Pearson Education.
[11]
J. S. Culpepper, M. Petri, and F. Scholer. 2012. Efficient in-memory top-k document retrieval. In Proceedings of the 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR’12). 225--234.
[12]
S. Culpepper, G. Navarro, S. Puglisi, and A. Turpin. 2010. Top-k ranked document search in general text databases. In Algorithms—ESA 2010. Lecture Notes in Computer Science, Vol. 6347. Springer, 194--205.
[13]
H. Ferrada and G. Navarro. 2014. Efficient compressed indexing for approximate top-k string retrieval. In String Processing and Information Retrieval. Lecture Notes in Computer Science, Vol. 8799. Springer, 18--30.
[14]
P. Ferragina and G. Manzini. 2005. Indexing compressed text. Journal of the ACM 52, 4, 552--581.
[15]
J. Fischer and V. Heun. 2011. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing 40, 2, 465--492.
[16]
T. Gagie, G. Navarro, and S. J. Puglisi. 2012. New algorithms on wavelet trees and applications to information retrieval. Theoretical Computer Science 426--427, 25--41.
[17]
S. Gog. 2011. Compressed Suffix Trees: Design, Construction, and Applications. Ph.D. Dissertation. University of Ulm, Germany.
[18]
S. Gog, T. Beller, A. Moffat, and M. Petri. 2014. From theory to practice: Plug and play with succinct data structures. In Proceedings of the 13th International Symposium on Experimental Algorithms (SEA’14). 326--337.
[19]
S. Gog and G. Navarro. 2015. Improved single-term top-k document retrieval. In Proceedings of the 17th Workshop on Algorithm Engineering and Experiments (ALENEX’15). 24--32.
[20]
S. Gog and M. Petri. 2014. Optimized succinct data structures for massive data. Software —Practice and Experience 44, 11, 1287--1314.
[21]
R. Grossi, A. Gupta, and J. Vitter. 2003. High-order entropy-compressed text indexes. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’03). 841--850.
[22]
W.-K. Hon, M. Patil, R. Shah, and S.-B. Wu. 2010. Efficient index for retrieving top-k most frequent documents. Journal of Discrete Algorithms 8, 4, 402--417.
[23]
W.-K. Hon, R. Shah, and J. S. Vitter. 2009. Space-efficient framework for top-k string retrieval problems. In Proceedings of the 50th Annual Symposium on Foundations of Computer Science (FOCS’09). 713--722.
[24]
J. Karkkainen, D. Kempa, and S. J. Puglisi. 2014. Hybrid compression of bitvectors for the FM-index. In Proceedings of the 24th Data Compression Conference (DCC’14). 302--311.
[25]
R. Konow and G. Navarro. 2013. Faster compact top-k document retrieval. In Proceedings of the 23rd Data Compression Conference (DCC’13). 351--360.
[26]
N. J. Larsson and A. Moffat. 2000. Off-line dictionary-based compression. Proceedings of the IEEE 88, 11, 1722--1732.
[27]
T.-Y. Liu. 2009. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3, 3, 225--331.
[28]
V. Mäkinen and G. Navarro. 2006. Position-restricted substring searching. In LATIN 2006: Theoretical Informatics. Lecture Notes in Computer Science, Vol. 3887. Springer, 703--714.
[29]
U. Manber and E. W. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing 22, 5, 935--948.
[30]
I. Munro. 1996. Tables. In Foundations of Software Technology and Theoretical Computer Science. Lecture Notes in Computer Science, Vol. 1180. Springer, 37--42.
[31]
J. I. Munro and V. Raman. 2002. Succinct representation of balanced parentheses and static trees. SIAM Journal on Computing 31, 3, 762--776.
[32]
S. Muthukrishnan. 2002. Efficient algorithms for document retrieval problems. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’02). 657--666.
[33]
G. Navarro. 2014. Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences. ACM Computing Surveys 46, 4, Article No. 52.
[34]
G. Navarro. 2016. Compact Data Structures: A Practical Approach. Cambridge University Press.
[35]
G. Navarro and V. Mäkinen. 2007. Compressed full-text indexes. ACM Computing Surveys 39, 1, Article No. 2.
[36]
G. Navarro and Y. Nekrich. 2012. Top-k document retrieval in optimal time and linear space. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’12).
[37]
G. Navarro, Y. Nekrich, and L. Russo. 2013. Space-efficient data-analysis queries on grids. Theoretical Computer Science 482, 60--72.
[38]
G. Navarro, S. J. Puglisi, and J. Sirén. 2014a. Document retrieval on repetitive collections. In Algorithms—ESA 2014. Lecture Notes in Computer Science, Vol. 8737. Springer, 725--736.
[39]
G. Navarro, S. J. Puglisi, and D. Valenzuela. 2014b. General document retrieval in compact space. ACM Journal of Experimental Algorithmics 19, 2, Article No. 3.
[40]
G. Navarro and K. Sadakane. 2014. Fully-functional static and dynamic succinct trees. ACM Transactions on Algorithms 10, 3, Article No. 16.
[41]
D. Okanohara and K. Sadakane. 2007. Practical entropy-compressed rank/select dictionary. In Proceedings of the 9th Workshop on Algorithm Engineering and Experiments (ALENEX’07). 60--70.
[42]
M. Patil, S. V. Thankachan, R. Shah, W.-K. Hon, J. S. Vitter, and S. Chandrasekaran. 2011. Inverted indexes for phrases and strings. In Proceedings of the 34th International ACM Conference on Research and Development in Information Retrieval (SIGIR’11). 555--564.
[43]
R. Raman, V. Raman, and S. S. Rao. 2007. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms 3, 4, Article No. 43.
[44]
W. Szpankowski. 1993. A generalized suffix tree and its (un)expected asymptotic behaviors. SIAM Journal on Computing 22, 6, 1176--1198.
[45]
S. Vigna. 2008. Broadword implementation of rank/select queries. In Proceedings of the 7th International Conference on Experimental Algorithms (WEA’08). 154--168.
[46]
P. Weiner. 1973. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory (SWAT’73). 1--11.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Journal of Experimental Algorithmics
ACM Journal of Experimental Algorithmics  Volume 22, Issue
2017
175 pages
ISSN:1084-6654
EISSN:1084-6654
DOI:10.1145/3047249
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 March 2017
Accepted: 01 December 2016
Received: 01 January 2016
Published in JEA Volume 22

Author Tags

  1. Compact data structure
  2. top-k document retrieval

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • basal funds FB0001, Conicyt, Chile; Fondecyt
  • Conicyt Ph.D. scholarship

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 334
    Total Downloads
  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media