[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2187836.2187854acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Compressed data structures for annotated web search

Published: 16 April 2012 Publication History

Abstract

Entity relationship search at Web scale depends on adding dozens of entity annotations to each of billions of crawled pages and indexing the annotations at rates comparable to regular text indexing. Even small entity search benchmarks from TREC and INEX suggest that the entity catalog support thousands of entity types and tens to hundreds of millions of entities. The above targets raise many challenges, major ones being the design of highly compressed data structures in RAM for spotting and disambiguating entity mentions, and highly compressed disk-based annotation indices. These data structures cannot be readily built upon standard inverted indices. Here we present a Web scale entity annotator and annotation index. Using a new workload-sensitive compressed multilevel map, we fit statistical disambiguation models for millions of entities within 1.15GB of RAM, and spend about 0.6 core-milliseconds per disambiguation. In contrast, DBPedia Spotlight spends 158 milliseconds, Wikipedia Miner spends 21 milliseconds, and Zemanta spends 9.5 milliseconds. Our annotation indices use ideas from vertical databases to reduce storage by 30%. On 40x8 cores with 40x3 disk spindles, we can annotate and index, in about a day, a billion Web pages with two million entities and 200,000 types from Wikipedia. Index decompression and scan speed are comparable to MG4J.

References

[1]
D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD Conference, pages 671--682. ACM, 2006.
[2]
S. Agrawal, K. Chakrabarti, S. Chaudhuri, V. Ganti, A. C. Konig, and D. Xin. Exploiting web search engines to search structured databases. In WWW Conference, pages 501--510, 2009.
[3]
K. Balog, L. Azzopardi, and M. de Rijke. A language modeling framework for expert finding. Information Processing and Management, 45(1):1--19, 2009.
[4]
P. Boldi and S. Vigna. Compressed perfect embedded skip lists for quick inverted-index lookups. In String Processing and Information Retrieval, volume 3772 of LNCS, pages 25--28. Springer, 2005.
[5]
P. Boldi and S. Vigna. MG4J at TREC 2005. In E. M. Voorhees and L. P. Buckland, editors,TREC, number SP 500-266 in Special Publications. NIST, 2005.
[6]
S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In WWW Conference, pages 717--726, Edinburgh, May 2006.
[7]
S. Chakrabarti, D. Sane, and G. Ramakrishnan. Web-scale entity-relation search architecture (poster). In WWW Conference, pages 21--22, 2011.
[8]
A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, 2006.
[9]
T. Cheng and K. C.-C. Chang. Beyond pages: supporting efficient, scalable entity search with dual-inversion index. In EDBT, pages 15--26. ACM, 2010.
[10]
T. Cheng, X. Yan, and K. C. Chang. EntityRank: Searching entities directly and holistically. In VLDB Conference, pages 387--398, Sept. 2007.
[11]
F. Chierichetti, S. Lattanzi, F. Mari, and A. Panconesi. On placing skips optimally in expectation. In WSDM Conference, pages 15--24, 2008.
[12]
S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In EMNLP Conference, pages 708--716, 2007.
[13]
S. Dill et al. SemTag and Seeker: Bootstrapping the semantic Web via automated semantic annotation. In WWW Conference, pages 178--186, 2003.
[14]
S. Ding, J. Attenberg, and T. Suel. Scalable techniques for document identifier assignment in inverted indexes. In WWW Conference, pages 311--320, 2010.
[15]
R. V. Guha and R. McCool. TAP: A semantic web test-bed.Journal of Web Semantics, 1(1):81--87, 2003.
[16]
X. Han, L. Sun, and J. Zhao. Collective entity linking in Web text: A graph-based method. In SIGIR Conference, pages 765--774, 2011.
[17]
J. Hoffart et al. Robust disambiguation of named entities in text. In EMNLP Conference, pages 782--792, Edinburgh, Scotland, UK, July 2011. SIGDAT.
[18]
T. C. Hu and A. C. Tucker. Optimal computer search trees and variable-length alphabetical codes.SIAM Journal of Applied Mathematics, 21(4):514--532, 1971.
[19]
G. Kasneci, F. M. Suchanek, G. Ifrim, S. Elbassuoni, M. Ramanath, and G. Weikum. NAGA: harvesting, searching and ranking knowledge. In SIGMOD Conference, pages 1285--1288. ACM, 2008.
[20]
S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective annotation of Wikipedia entities in Web text. In SIGKDD Conference, pages 457--466, 2009.
[21]
X. Li, C. Li, and C. Yu. EntityEngine: Answering entity-relationship queries using shallow semantics. In CIKM, Oct. 2010. (demo).
[22]
C. Martinez and S. Roura. Optimal and nearly optimal static weighted skip lists. Technical Report LSI-95-34-R, Universitat Politecnica de Catalunya, 1995.
[23]
P. N. Mendes, M. Jakob, A. Garcia-Silva, and C. Bizer. DBpedia Spotlight: Shedding light on the Web of documents. In 7th International Conference on Semantic Systems, pages 1--8, 2011.
[24]
R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. In CIKM, pages 233--242, 2007.
[25]
D. Milne. An open-source toolkit for mining Wikipedia (wikipedia-miner.sourceforge.net=publications.htm). To be announced, 2009.
[26]
D. Milne and I. H. Witten. Learning to link with Wikipedia. In CIKM, pages 509--518, 2008.
[27]
S. Sarawagi. Information extraction. FnT Databases, 1(3), 2008.
[28]
V. R. Shanks, H. E. Williams, and A. Cannane. Indexing for fast categorisation. In Australasian Computer Science Conference, volume 16 of ACSC26, pages 119--127, Darlinghurst, Australia, 2003.
[29]
F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO: A core of semantic knowledge unifying WordNet and Wikipedia. In WWW Conference, pages 697--706. ACM Press, 2007.
[30]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan-Kaufmann, May 1999.
[31]
M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In ICDE, pages 59--70, 2006.

Cited By

View all

Index Terms

  1. Compressed data structures for annotated web search

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '12: Proceedings of the 21st international conference on World Wide Web
    April 2012
    1078 pages
    ISBN:9781450312295
    DOI:10.1145/2187836
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    • Univ. de Lyon: Universite de Lyon

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 April 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. annotation
    2. entity
    3. indexing
    4. search

    Qualifiers

    • Research-article

    Conference

    WWW 2012
    Sponsor:
    • Univ. de Lyon
    WWW 2012: 21st World Wide Web Conference 2012
    April 16 - 20, 2012
    Lyon, France

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 23 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Ad hoc retrieval via entity linking and semantic similarityKnowledge and Information Systems10.1007/s10115-018-1190-158:3(551-583)Online publication date: 1-Mar-2019
    • (2018)Knowledge Extraction and Inference from TextThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3210190(1399-1402)Online publication date: 27-Jun-2018
    • (2018)Term-Based Models for Entity RankingEntity-Oriented Search10.1007/978-3-319-93935-3_3(57-99)Online publication date: 3-Oct-2018
    • (2018)IntroductionEntity-Oriented Search10.1007/978-3-319-93935-3_1(1-23)Online publication date: 3-Oct-2018
    • (2013)Learning relatedness measures for entity linkingProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505711(139-148)Online publication date: 27-Oct-2013
    • (2013)Learning joint query interpretation and response rankingProceedings of the 22nd international conference on World Wide Web10.1145/2488388.2488484(1099-1110)Online publication date: 13-May-2013
    • (2013)Data-based research at IIT BombayACM SIGMOD Record10.1145/2481528.248153642:1(38-43)Online publication date: 1-May-2013
    • (2013)Improved text annotation with Wikipedia entitiesProceedings of the 28th Annual ACM Symposium on Applied Computing10.1145/2480362.2480425(288-295)Online publication date: 18-Mar-2013
    • (2013)Semi-automatic Dictionary Curation for Domain-Specific OntologiesProceedings of the 2013 IEEE 25th International Conference on Tools with Artificial Intelligence10.1109/ICTAI.2013.112(727-734)Online publication date: 4-Nov-2013
    • (2013)Web-scale entity annotation using MapReduce20th Annual International Conference on High Performance Computing10.1109/HiPC.2013.6799137(99-108)Online publication date: Dec-2013

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media