More Web Proxy on the site http://driver.im/

article

Efficient index for retrieving top-k most frequent documents

Authors:

Shih-Bin WuAuthors Info & Claims

Journal of Discrete Algorithms, Volume 8, Issue 4

Pages 402 - 417

https://doi.org/10.1016/j.jda.2010.08.003

Published: 01 December 2010 Publication History

Abstract

In the document retrieval problem (Muthukrishnan, 2002), we are given a collection of documents (strings) of total length D in advance, and our target is to create an index for these documents such that for any subsequent input pattern P, we can identify which documents in the collection contain P. In this paper, we study a natural extension to the above document retrieval problem. We call this top-k frequent document retrieval, where instead of listing all documents containing P, our focus is to identify the top-k documents having most occurrences of P. This problem forms a basis for search engine tasks of retrieving documents ranked with TFIDF (Term Frequency-Inverse Document Frequency) metric. A related problem was studied by Muthukrishnan (2002) where the emphasis was on retrieving all the documents whose number of occurrences of the pattern P exceeds some frequency threshold f. However, from the information retrieval point of view, it is hard for a user to specify such a threshold value f and have a sense of how many documents will be reported as the output. We develop some additional building blocks which help the user overcome this limitation. These are used to derive an efficient index for top-k frequent document retrieval problem, answering queries in O(|P|+logDloglogD+k) time and taking O(DlogD) space. Our approach is based on a new use of the suffix tree called induced generalized suffix tree (IGST). The practicality of the proposed index is validated by the experimental results.

References

[1]

M.A. Bender, M. Farach-Colton, The LCA problem revisited, in: Proceedings of Latin American Symposium on Theoretical Informatics, 2000, pp. 88-94.

Digital Library

[2]

I. Bialynicka-Birula, R. Grossi, Rank-sensitive data structures, in: Proceedings of International Symposium on String Processing and Information Retrieval, 2005, pp. 79-90.

Digital Library

[3]

Boyer, R.S. and Moore, J.S., A fast string searching algorithm. Communications of the ACM. v20 i10. 762-772.

[4]

Cormen, T.H., Leiserson, C.E., Rivest, R.L. and Stein, C., Introduction to Algorithms. 2001. MIT Press.

[5]

http://www.genpat.uu.se/mtDB

[6]

http://en.wikipedia.org/wiki/EMule

[7]

W.K. Hon, R. Shah, J.S. Vitter, Ordered pattern matching: Towards full-text retrieval, Technical Report TR-06-008, Department of CS, Purdue University, 2006.

[8]

L.C.K. Hui, Color set size problem with applications to string matching, in: Proceedings of Symposium on Combinatorial Pattern Matching, 1992, pp. 230-243.

[9]

R.M. Karp, M.O. Rabin, Efficient randomized pattern-matching algorithms, Technical Report TR-31-81, Aiken Computational Laboratory, Harvard University, 1981.

[10]

Knuth, D.E., Morris, J.H. and Pratt, V.B., Fast pattern matching in strings. SIAM Journal on Computing. v6 i2. 323-350.

[11]

V. Makinen, G. Navarro, Position-restricted substring searching, in: Proceedings of Latin American Symposium on Theoretical Informatics, 2006, pp. 703-714.

Digital Library

[12]

Y. Matias, S. Muthukrishnan, S.C. Sahinalp, J. Ziv, Augmenting suffix trees, with applications, in: Proceedings of European Symposium on Algorithms, 1998, pp. 67-78.

Digital Library

[13]

McCreight, E.M., A space-economical suffix tree construction algorithm. Journal of the ACM. v23 i2. 262-272.

[14]

S. Muthukrishnan, Efficient algorithms for document retrieval problems, in: Proceedings of Symposium on Discrete Algorithms, 2002, pp. 657-666.

Digital Library

[15]

K. Sadakane, Succinct representations of lcp information and improvements in the compressed suffix arrays, in: Proceedings of Symposium on Discrete Algorithms, 2002, pp. 225-232.

[16]

K. Sadakane, Compressed suffix trees with full functionality, in: Theory of Computing Systems, 2007, pp. 589-607.

Digital Library

[17]

http://sourceforge.net

[18]

N. Valimaki, V. Makinen, Space-efficient algorithms for document retrieval, in: Proceedings of Symposium on Combinatorial Pattern Matching, 2007, pp. 205-215.

[19]

P. Weiner, Linear pattern matching algorithms, in: Proceedings of Symposium on Switching and Automata Theory, 1973, pp. 1-11.

Digital Library

[20]

http://mathworld.wolfram.com/ZipfsLaw.html

[21]

Willard, D.E., Log-logarithmic worst-case range queries are possible in space ¿(N). Information Processing Letters. v17 i2. 81-84.

[22]

Witten, I., Moffat, A. and Bell, T., Managing Gigabytes: Compressing and Indexing Documents and Images. 1999. Morgan Kaufmann Publishers, Los Altos, CA, USA.

Cited By

Bille PGørtz IPedersen MSteiner T(2022)Gapped Indexing for Consecutive OccurrencesAlgorithmica10.1007/s00453-022-01051-685:4(879-901)Online publication date: 20-Oct-2022
https://dl.acm.org/doi/10.1007/s00453-022-01051-6
Gog SKonow RNavarro G(2017)Practical Compact Indexes for Top-k Document RetrievalACM Journal of Experimental Algorithmics10.1145/304395822(1-37)Online publication date: 2-Mar-2017
https://dl.acm.org/doi/10.1145/3043958
Navarro GNekrich Y(2017)Time-Optimal Top-$k$ Document RetrievalSIAM Journal on Computing10.1137/14099894946:1(80-113)Online publication date: 8-Feb-2017
https://dl.acm.org/doi/10.1137/140998949
Show More Cited By

Efficient index for retrieving top-k most frequent documents
1. Information systems
  1. Information retrieval

Recommendations

Efficient Index for Retrieving Top-k Most Frequent Documents
SPIRE '09: Proceedings of the 16th International Symposium on String Processing and Information Retrieval

In the <em>document retrieval problem</em> [9], we are given a collection of documents (strings) of total length <em>D</em> in advance, and our target is to create an index for these documents such that for any subsequent input pattern <em>P</em> , we ...
A passage-based approach to learning to rank documents
Abstract
According to common relevance-judgments regimes, such as TREC’s, a document can be deemed relevant to a query even if it contains a very short passage of text with pertinent information. This fact has motivated work on passage-based document ...
Using autobiographic information to retrieve real and electronic documents
Proceedings of the 2007 conference on Human interface: Part I

Current systems force users to store their documents in hierarchic filesystems. However, users remember their documents not in terms of ad-hoc categories, but of their contents, why they were written, etc. Describing documents using such autobiographic ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Discrete Algorithms

Journal of Discrete Algorithms Volume 8, Issue 4

December, 2010

92 pages

ISSN:1570-8667

Issue’s Table of Contents

Copyright © Elsevier B.V. © 2010.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 December 2010

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bille PGørtz IPedersen MSteiner T(2022)Gapped Indexing for Consecutive OccurrencesAlgorithmica10.1007/s00453-022-01051-685:4(879-901)Online publication date: 20-Oct-2022
https://dl.acm.org/doi/10.1007/s00453-022-01051-6
Gog SKonow RNavarro G(2017)Practical Compact Indexes for Top-k Document RetrievalACM Journal of Experimental Algorithmics10.1145/304395822(1-37)Online publication date: 2-Mar-2017
https://dl.acm.org/doi/10.1145/3043958
Navarro GNekrich Y(2017)Time-Optimal Top-$k$ Document RetrievalSIAM Journal on Computing10.1137/14099894946:1(80-113)Online publication date: 8-Feb-2017
https://dl.acm.org/doi/10.1137/140998949
Hon WShah RThankachan SVitter J(2014)Space-Efficient Frameworks for Top-k String RetrievalJournal of the ACM10.1145/259077461:2(1-36)Online publication date: 24-Apr-2014
https://dl.acm.org/doi/10.1145/2590774
Navarro G(2014)Spaces, Trees, and ColorsACM Computing Surveys10.1145/253593346:4(1-47)Online publication date: 1-Mar-2014
https://dl.acm.org/doi/10.1145/2535933
Ferrada HNavarro G(2014)Efficient Compressed Indexing for Approximate Top-k String RetrievalProceedings of the 21st International Symposium on String Processing and Information Retrieval - Volume 879910.1007/978-3-319-11918-2_3(18-30)Online publication date: 20-Oct-2014
https://dl.acm.org/doi/10.1007/978-3-319-11918-2_3
Navarro GThankachan S(2013)Faster Top-k Document Retrieval in Optimal SpaceProceedings of the 20th International Symposium on String Processing and Information Retrieval - Volume 821410.1007/978-3-319-02432-5_28(255-262)Online publication date: 7-Oct-2013
https://dl.acm.org/doi/10.1007/978-3-319-02432-5_28
Navarro GNekrich Y(2012)Top-k document retrieval in optimal time and linear spaceProceedings of the twenty-third annual ACM-SIAM symposium on Discrete algorithms10.5555/2095116.2095200(1066-1077)Online publication date: 17-Jan-2012
https://dl.acm.org/doi/10.5555/2095116.2095200
Patil MThankachan SShah RHon WVitter JChandrasekaran SMa WNie JBaeza-Yates RChua TCroft W(2011)Inverted indexes for phrases and stringsProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval10.1145/2009916.2009992(555-564)Online publication date: 24-Jul-2011
https://dl.acm.org/doi/10.1145/2009916.2009992

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents