[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1458082.1458138acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

TinyLex: static n-gram index pruning with perfect recall

Published: 26 October 2008 Publication History

Abstract

Inverted indexes using sequences of characters (n-grams) as terms provide an error-resilient and language-independent way to query for arbitrary substrings and perform approximate matching in a text, but present a number of practical problems: they have a very large number of terms, they exhibit pathologically expensive worst-case query times on certain natural inputs, and they cannot cope with very short query strings. In word-based indexes, static index pruning has been successful in reducing index size while maintaining precision, at the expense of recall. Taking advantage of the unique inclusion structure of n-gram terms of different lengths, we show that the lexicon size of an n-gram index can be reduced by 7 to 15 times without any loss of recall, and without any increase in either index size or query time. Because the lexicon is typically stored in main memory, this substantially reduces the memory required for queries. Simultaneously, our construction is also the first overlapping n-gram index to place tunable worst-case bounds on false positives and to permit efficient queries on strings of any length. Using this construction, we also demonstrate the first feasible n-gram index using words rather than characters as units, and its applications to phrase searching.

References

[1]
E. S. Adams and A. C. Meltzer. Trigrams as index elements in full text retrieval: Observations and experimental results. In ACM Conference on Computer Science, pages 433--439, 1993.
[2]
V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Inf. Retr., 8(1):151--166, 2005.
[3]
R. Arnold and T. Bell. A corpus for the evaluation of lossless compression algorithms. In DCC '97: Proceedings of the Conference on Data Compression, page 201, Washington, DC, USA, 1997. IEEE Computer Society.
[4]
D. Bahle, H. E. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. In SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 215--221, New York, NY, USA, 2002. ACM.
[5]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970.
[6]
S. Büttcher and C. L. A. Clarke. A document-centric approach to static index pruning in text retrieval systems. In CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management, pages 182--189, New York, NY, USA, 2006. ACM.
[7]
D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 43--50, New York, NY, USA, 2001. ACM.
[8]
R. J. D'Amore and C. P. Mah. One-time complete indexing of text: theory and practice. In SIGIR, pages 155--164, 1985.
[9]
R. Dementiev, J. Kärkkäinen, J. Mehnert, and P. Sanders. Better external memory suffix array construction. In ALENEX/ANALCO, pages 86--97, 2005.
[10]
L. Egghe. The distribution of n-grams. Scientometrics, 47(2):237--252, 2000.
[11]
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409(6822):860--921, 2001.
[12]
W. Kent, C. W. Sugnet, T. S. Furey, K. Roskin, T. H. Pringle, A. M. Zahler, and D. Haussler. The Human Genome Browser at UCSC. Genome Res, 12(6):996--1006, 2002.
[13]
M.-S. Kim, K.-Y. Whang, J.-G. Lee, and M.-J. Lee. n-Gram/2L: A space and time efficient two-level n-gram inverted index structure. In VLDB, pages 325--336, 2005.
[14]
C. P. Mah and R. J. D'Amore. Complete statistical indexing of text by overlapping word fragments. SIGIR Forum, 17(3):6--16, 1982.
[15]
V. Mäkinen. Trade off between compression and search times in compact suffix array. In ALENEX '01: Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation, pages 189--201, London, UK, 2001. Springer-Verlag.
[16]
U. Manber and S. Wu. GLIMPSE: a tool to search through entire file systems. In WTEC'94: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, pages 4?4, Berkeley, CA, USA, 1994. USENIX Association.
[17]
J. Mayfield and P. McNamee. Single n-gram stemming. In SIGIR, pages 415--416, 2003.
[18]
P. McNamee, J. Mayfield, and C. Piatko. Haircut: a system for multilingual text retrieval in java. J. Comput. Small Coll., 17(3):8--22, 2002.
[19]
A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Inf. Retr., 3(1):25--47, 2000.
[20]
Y. Ogawa and T. Matsuda. An efficient document retrieval method using n-gram indexing. Systems and Computers in Japan, 33(2):54--63, 2002.
[21]
V. Siivola and B. Pellom. Growing an n-gram language model. 2005.
[22]
A. Stolcke. Entropy-based pruning of backoff language models. In Proceedings of the DRAPA News Transcription and Understanding Workshop, pages 270--274, 1998.
[23]
H. E. Williams, J. Zobel, and P. Anderson. What's next? index structures for efficient phrase querying. In Australasian Database Conference, pages 141--152, 1999.
[24]
S. Wu and U. Manber. Agrep: A fast approximate pattern-matching tool. In Proc. of the Winter 1992 USENIX Conference, pages 153--162, San Francisco, California, 1991.
[25]
M. Yamamoto and K. W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1--30, 2001.
[26]
J. Zobel, A. Moffat, and K. Ramamohanarao. Inverted files versus signature files for text indexing. ACM Trans. Database Syst., 23(4):453--490, 1998.

Cited By

View all
  • (2013)Efficient processing of substring match queries with inverted variable-length gram indexesInformation Sciences10.1016/j.ins.2013.04.037244(119-141)Online publication date: Sep-2013
  • (2012)Query retrieval enhancement based on Huffman index terms encodingProceedings of the 3rd International Conference on Information and Communication Systems10.1145/2222444.2222450(1-5)Online publication date: 3-Apr-2012
  • (2011)Using the H-Divergence to Prune Probabilistic AutomataProceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence10.1109/ICTAI.2011.114(725-731)Online publication date: 7-Nov-2011
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
October 2008
1562 pages
ISBN:9781595939913
DOI:10.1145/1458082
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. TinyLex
  2. index
  3. n-gram
  4. phrase searching
  5. pruning
  6. static pruning

Qualifiers

  • Research-article

Conference

CIKM08
CIKM08: Conference on Information and Knowledge Management
October 26 - 30, 2008
California, Napa Valley, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2013)Efficient processing of substring match queries with inverted variable-length gram indexesInformation Sciences10.1016/j.ins.2013.04.037244(119-141)Online publication date: Sep-2013
  • (2012)Query retrieval enhancement based on Huffman index terms encodingProceedings of the 3rd International Conference on Information and Communication Systems10.1145/2222444.2222450(1-5)Online publication date: 3-Apr-2012
  • (2011)Using the H-Divergence to Prune Probabilistic AutomataProceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence10.1109/ICTAI.2011.114(725-731)Online publication date: 7-Nov-2011
  • (2010)Ottoman archives explorerJournal on Computing and Cultural Heritage 10.1145/1658346.16583482:3(1-20)Online publication date: 5-Jan-2010
  • (2010)Efficient processing of substring match queries with inverted q-gram indexes2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)10.1109/ICDE.2010.5447866(721-732)Online publication date: Mar-2010
  • (2009)Revisiting N-Gram Based Models for Retrieval in Degraded Large CollectionsAdvances in Information Retrieval10.1007/978-3-642-00958-7_66(680-684)Online publication date: 2009

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media