More Web Proxy on the site http://driver.im/

research-article

TinyLex: static n-gram index pruning with perfect recall

Author:

Derrick CoetzeeAuthors Info & Claims

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Pages 409 - 418

https://doi.org/10.1145/1458082.1458138

Published: 26 October 2008 Publication History

Abstract

Inverted indexes using sequences of characters (n-grams) as terms provide an error-resilient and language-independent way to query for arbitrary substrings and perform approximate matching in a text, but present a number of practical problems: they have a very large number of terms, they exhibit pathologically expensive worst-case query times on certain natural inputs, and they cannot cope with very short query strings. In word-based indexes, static index pruning has been successful in reducing index size while maintaining precision, at the expense of recall. Taking advantage of the unique inclusion structure of n-gram terms of different lengths, we show that the lexicon size of an n-gram index can be reduced by 7 to 15 times without any loss of recall, and without any increase in either index size or query time. Because the lexicon is typically stored in main memory, this substantially reduces the memory required for queries. Simultaneously, our construction is also the first overlapping n-gram index to place tunable worst-case bounds on false positives and to permit efficient queries on strings of any length. Using this construction, we also demonstrate the first feasible n-gram index using words rather than characters as units, and its applications to phrase searching.

References

[1]

E. S. Adams and A. C. Meltzer. Trigrams as index elements in full text retrieval: Observations and experimental results. In ACM Conference on Computer Science, pages 433--439, 1993.

Digital Library

[2]

V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Inf. Retr., 8(1):151--166, 2005.

Digital Library

[3]

R. Arnold and T. Bell. A corpus for the evaluation of lossless compression algorithms. In DCC '97: Proceedings of the Conference on Data Compression, page 201, Washington, DC, USA, 1997. IEEE Computer Society.

Digital Library

[4]

D. Bahle, H. E. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. In SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 215--221, New York, NY, USA, 2002. ACM.

Digital Library

[5]

B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970.

Digital Library

[6]

S. Büttcher and C. L. A. Clarke. A document-centric approach to static index pruning in text retrieval systems. In CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management, pages 182--189, New York, NY, USA, 2006. ACM.

Digital Library

[7]

D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 43--50, New York, NY, USA, 2001. ACM.

Digital Library

[8]

R. J. D'Amore and C. P. Mah. One-time complete indexing of text: theory and practice. In SIGIR, pages 155--164, 1985.

Digital Library

[9]

R. Dementiev, J. Kärkkäinen, J. Mehnert, and P. Sanders. Better external memory suffix array construction. In ALENEX/ANALCO, pages 86--97, 2005.

[10]

L. Egghe. The distribution of n-grams. Scientometrics, 47(2):237--252, 2000.

[11]

International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409(6822):860--921, 2001.

[12]

W. Kent, C. W. Sugnet, T. S. Furey, K. Roskin, T. H. Pringle, A. M. Zahler, and D. Haussler. The Human Genome Browser at UCSC. Genome Res, 12(6):996--1006, 2002.

[13]

M.-S. Kim, K.-Y. Whang, J.-G. Lee, and M.-J. Lee. n-Gram/2L: A space and time efficient two-level n-gram inverted index structure. In VLDB, pages 325--336, 2005.

Digital Library

[14]

C. P. Mah and R. J. D'Amore. Complete statistical indexing of text by overlapping word fragments. SIGIR Forum, 17(3):6--16, 1982.

Digital Library

[15]

V. Mäkinen. Trade off between compression and search times in compact suffix array. In ALENEX '01: Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation, pages 189--201, London, UK, 2001. Springer-Verlag.

Digital Library

[16]

U. Manber and S. Wu. GLIMPSE: a tool to search through entire file systems. In WTEC'94: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, pages 4?4, Berkeley, CA, USA, 1994. USENIX Association.

Digital Library

[17]

J. Mayfield and P. McNamee. Single n-gram stemming. In SIGIR, pages 415--416, 2003.

Digital Library

[18]

P. McNamee, J. Mayfield, and C. Piatko. Haircut: a system for multilingual text retrieval in java. J. Comput. Small Coll., 17(3):8--22, 2002.

Digital Library

[19]

A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Inf. Retr., 3(1):25--47, 2000.

Digital Library

[20]

Y. Ogawa and T. Matsuda. An efficient document retrieval method using n-gram indexing. Systems and Computers in Japan, 33(2):54--63, 2002.

[21]

V. Siivola and B. Pellom. Growing an n-gram language model. 2005.

[22]

A. Stolcke. Entropy-based pruning of backoff language models. In Proceedings of the DRAPA News Transcription and Understanding Workshop, pages 270--274, 1998.

[23]

H. E. Williams, J. Zobel, and P. Anderson. What's next? index structures for efficient phrase querying. In Australasian Database Conference, pages 141--152, 1999.

[24]

S. Wu and U. Manber. Agrep: A fast approximate pattern-matching tool. In Proc. of the Winter 1992 USENIX Conference, pages 153--162, San Francisco, California, 1991.

[25]

M. Yamamoto and K. W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1--30, 2001.

Digital Library

[26]

J. Zobel, A. Moffat, and K. Ramamohanarao. Inverted files versus signature files for text indexing. ACM Trans. Database Syst., 23(4):453--490, 1998.

Digital Library

Cited By

Kim YPark HShim KWoo K(2013)Efficient processing of substring match queries with inverted variable-length gram indexesInformation Sciences10.1016/j.ins.2013.04.037244(119-141)Online publication date: Sep-2013
https://doi.org/10.1016/j.ins.2013.04.037
Al-Jedady AAlsmadi IJarrah OQuwaider M(2012)Query retrieval enhancement based on Huffman index terms encodingProceedings of the 3rd International Conference on Information and Communication Systems10.1145/2222444.2222450(1-5)Online publication date: 3-Apr-2012
https://dl.acm.org/doi/10.1145/2222444.2222450
Bernard MJeudy BPeyrache JSebban MThollard F(2011)Using the H-Divergence to Prune Probabilistic AutomataProceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence10.1109/ICTAI.2011.114(725-731)Online publication date: 7-Nov-2011
https://dl.acm.org/doi/10.1109/ICTAI.2011.114
Show More Cited By

Index Terms

TinyLex: static n-gram index pruning with perfect recall
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression
  2. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing

Recommendations

Inverted indexes for phrases and strings
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression ...
Multilingual author profiling on Facebook

Proposed a multilingual (Roman Urdu and English) author profiling corpus of Facebook profiles.Manually developed a bilingual dictionary (Roman Urdu to English) of 7749 entries and translated multilingual corpus using it.Applied 64 stylometry and 11 ...
An unsupervised method for identifying loanwords in Korean

This paper presents an unsupervised method for developing a character-based n-gram classifier that identifies loanwords or transliterated foreign words in Korean text. The classifier is trained on an unlabeled corpus using the Expectation Maximization ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

October 2008

1562 pages

ISBN:9781595939913

DOI:10.1145/1458082

General Chair:
James G. Shanahan
Church and Duncan Group Inc, USA
,
Program Chairs:
Sihem Amer-Yahia
Yahoo! Research, USA
,
Ioana Manolescu
INRIA, France
,
Yi Zhang
University of California, Santa Cruz, USA
,
David A. Evans
JustSystems Evans Research, USA
,
Alek Kolcz
Microsoft Live Labs, USA
,
Key-Sun Choi
KAIST, Korea
,
Abdur Chowdury
Twitter, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM08

Sponsor:

CIKM08: Conference on Information and Knowledge Management

October 26 - 30, 2008

California, Napa Valley, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
312
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kim YPark HShim KWoo K(2013)Efficient processing of substring match queries with inverted variable-length gram indexesInformation Sciences10.1016/j.ins.2013.04.037244(119-141)Online publication date: Sep-2013
https://doi.org/10.1016/j.ins.2013.04.037
Al-Jedady AAlsmadi IJarrah OQuwaider M(2012)Query retrieval enhancement based on Huffman index terms encodingProceedings of the 3rd International Conference on Information and Communication Systems10.1145/2222444.2222450(1-5)Online publication date: 3-Apr-2012
https://dl.acm.org/doi/10.1145/2222444.2222450
Bernard MJeudy BPeyrache JSebban MThollard F(2011)Using the H-Divergence to Prune Probabilistic AutomataProceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence10.1109/ICTAI.2011.114(725-731)Online publication date: 7-Nov-2011
https://dl.acm.org/doi/10.1109/ICTAI.2011.114
Yalniz IAltingovde IGüdükbay UUlusoy Ö(2010)Ottoman archives explorerJournal on Computing and Cultural Heritage 10.1145/1658346.16583482:3(1-20)Online publication date: 5-Jan-2010
https://dl.acm.org/doi/10.1145/1658346.1658348
Kim YWoo KPark HShim K(2010)Efficient processing of substring match queries with inverted q-gram indexes2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)10.1109/ICDE.2010.5447866(721-732)Online publication date: Mar-2010
https://doi.org/10.1109/ICDE.2010.5447866
Parapar JFreire ABarreiro Á(2009)Revisiting N-Gram Based Models for Retrieval in Degraded Large CollectionsAdvances in Information Retrieval10.1007/978-3-642-00958-7_66(680-684)Online publication date: 2009
https://doi.org/10.1007/978-3-642-00958-7_66

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten