More Web Proxy on the site http://driver.im/

research-article

Compressed data structures for annotated web search

Authors:

Soumen Chakrabarti,

Sasidhar Kasturi,

Bharath Balakrishnan,

Ganesh Ramakrishnan,

Rohit SarafAuthors Info & Claims

WWW '12: Proceedings of the 21st international conference on World Wide Web

Pages 121 - 130

https://doi.org/10.1145/2187836.2187854

Published: 16 April 2012 Publication History

Abstract

Entity relationship search at Web scale depends on adding dozens of entity annotations to each of billions of crawled pages and indexing the annotations at rates comparable to regular text indexing. Even small entity search benchmarks from TREC and INEX suggest that the entity catalog support thousands of entity types and tens to hundreds of millions of entities. The above targets raise many challenges, major ones being the design of highly compressed data structures in RAM for spotting and disambiguating entity mentions, and highly compressed disk-based annotation indices. These data structures cannot be readily built upon standard inverted indices. Here we present a Web scale entity annotator and annotation index. Using a new workload-sensitive compressed multilevel map, we fit statistical disambiguation models for millions of entities within 1.15GB of RAM, and spend about 0.6 core-milliseconds per disambiguation. In contrast, DBPedia Spotlight spends 158 milliseconds, Wikipedia Miner spends 21 milliseconds, and Zemanta spends 9.5 milliseconds. Our annotation indices use ideas from vertical databases to reduce storage by 30%. On 40x8 cores with 40x3 disk spindles, we can annotate and index, in about a day, a billion Web pages with two million entities and 200,000 types from Wikipedia. Index decompression and scan speed are comparable to MG4J.

References

[1]

D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD Conference, pages 671--682. ACM, 2006.

Digital Library

[2]

S. Agrawal, K. Chakrabarti, S. Chaudhuri, V. Ganti, A. C. Konig, and D. Xin. Exploiting web search engines to search structured databases. In WWW Conference, pages 501--510, 2009.

Digital Library

[3]

K. Balog, L. Azzopardi, and M. de Rijke. A language modeling framework for expert finding. Information Processing and Management, 45(1):1--19, 2009.

Digital Library

[4]

P. Boldi and S. Vigna. Compressed perfect embedded skip lists for quick inverted-index lookups. In String Processing and Information Retrieval, volume 3772 of LNCS, pages 25--28. Springer, 2005.

Digital Library

[5]

P. Boldi and S. Vigna. MG4J at TREC 2005. In E. M. Voorhees and L. P. Buckland, editors,TREC, number SP 500-266 in Special Publications. NIST, 2005.

[6]

S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In WWW Conference, pages 717--726, Edinburgh, May 2006.

Digital Library

[7]

S. Chakrabarti, D. Sane, and G. Ramakrishnan. Web-scale entity-relation search architecture (poster). In WWW Conference, pages 21--22, 2011.

Digital Library

[8]

A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, 2006.

Digital Library

[9]

T. Cheng and K. C.-C. Chang. Beyond pages: supporting efficient, scalable entity search with dual-inversion index. In EDBT, pages 15--26. ACM, 2010.

Digital Library

[10]

T. Cheng, X. Yan, and K. C. Chang. EntityRank: Searching entities directly and holistically. In VLDB Conference, pages 387--398, Sept. 2007.

Digital Library

[11]

F. Chierichetti, S. Lattanzi, F. Mari, and A. Panconesi. On placing skips optimally in expectation. In WSDM Conference, pages 15--24, 2008.

Digital Library

[12]

S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In EMNLP Conference, pages 708--716, 2007.

[13]

S. Dill et al. SemTag and Seeker: Bootstrapping the semantic Web via automated semantic annotation. In WWW Conference, pages 178--186, 2003.

Digital Library

[14]

S. Ding, J. Attenberg, and T. Suel. Scalable techniques for document identifier assignment in inverted indexes. In WWW Conference, pages 311--320, 2010.

Digital Library

[15]

R. V. Guha and R. McCool. TAP: A semantic web test-bed.Journal of Web Semantics, 1(1):81--87, 2003.

[16]

X. Han, L. Sun, and J. Zhao. Collective entity linking in Web text: A graph-based method. In SIGIR Conference, pages 765--774, 2011.

Digital Library

[17]

J. Hoffart et al. Robust disambiguation of named entities in text. In EMNLP Conference, pages 782--792, Edinburgh, Scotland, UK, July 2011. SIGDAT.

Digital Library

[18]

T. C. Hu and A. C. Tucker. Optimal computer search trees and variable-length alphabetical codes.SIAM Journal of Applied Mathematics, 21(4):514--532, 1971.

[19]

G. Kasneci, F. M. Suchanek, G. Ifrim, S. Elbassuoni, M. Ramanath, and G. Weikum. NAGA: harvesting, searching and ranking knowledge. In SIGMOD Conference, pages 1285--1288. ACM, 2008.

Digital Library

[20]

S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective annotation of Wikipedia entities in Web text. In SIGKDD Conference, pages 457--466, 2009.

Digital Library

[21]

X. Li, C. Li, and C. Yu. EntityEngine: Answering entity-relationship queries using shallow semantics. In CIKM, Oct. 2010. (demo).

Digital Library

[22]

C. Martinez and S. Roura. Optimal and nearly optimal static weighted skip lists. Technical Report LSI-95-34-R, Universitat Politecnica de Catalunya, 1995.

[23]

P. N. Mendes, M. Jakob, A. Garcia-Silva, and C. Bizer. DBpedia Spotlight: Shedding light on the Web of documents. In 7th International Conference on Semantic Systems, pages 1--8, 2011.

Digital Library

[24]

R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. In CIKM, pages 233--242, 2007.

Digital Library

[25]

D. Milne. An open-source toolkit for mining Wikipedia (wikipedia-miner.sourceforge.net=publications.htm). To be announced, 2009.

[26]

D. Milne and I. H. Witten. Learning to link with Wikipedia. In CIKM, pages 509--518, 2008.

Digital Library

[27]

S. Sarawagi. Information extraction. FnT Databases, 1(3), 2008.

Digital Library

[28]

V. R. Shanks, H. E. Williams, and A. Cannane. Indexing for fast categorisation. In Australasian Computer Science Conference, volume 16 of ACSC26, pages 119--127, Darlinghurst, Australia, 2003.

Digital Library

[29]

F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO: A core of semantic knowledge unifying WordNet and Wikipedia. In WWW Conference, pages 697--706. ACM Press, 2007.

Digital Library

[30]

I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan-Kaufmann, May 1999.

Digital Library

[31]

M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In ICDE, pages 59--70, 2006.

Digital Library

Cited By

Ensan FDu W(2019)Ad hoc retrieval via entity linking and semantic similarityKnowledge and Information Systems10.1007/s10115-018-1190-158:3(551-583)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s10115-018-1190-1
Chakrabarti SCollins-Thompson KMei QDavison BLiu YYilmaz E(2018)Knowledge Extraction and Inference from TextThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3210190(1399-1402)Online publication date: 27-Jun-2018
https://dl.acm.org/doi/10.1145/3209978.3210190
Balog KBalog K(2018)Term-Based Models for Entity RankingEntity-Oriented Search10.1007/978-3-319-93935-3_3(57-99)Online publication date: 3-Oct-2018
https://doi.org/10.1007/978-3-319-93935-3_3
Show More Cited By

Index Terms

Compressed data structures for annotated web search
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Online annotation of text streams with structured entities
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

We propose a framework and algorithm for annotating unbounded text streams with entities of a structured database. The algorithm allows one to correlate unstructured and dirty text streams from sources such as emails, chats and blogs, to entities stored ...
Comparison of Methods to Annotate Named Entity Corpora

The authors compared two methods for annotating a corpus for the named entity (NE) recognition task using non-expert annotators: (i) revising the results of an existing NE recognizer and (ii) manually annotating the NEs completely. The annotation time, ...
Towards annotation of video as part of search
MMM'10: Proceedings of the 16th international conference on Advances in Multimedia Modeling

Search for multimedia is hampered by both the lack of quality annotations and a quantity of annotations. In recent years there has been a growth in multimedia search services that emphasis interactivity between the user and the interface. Some of these ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '12: Proceedings of the 21st international conference on World Wide Web

April 2012

1078 pages

ISBN:9781450312295

DOI:10.1145/2187836

General Chairs:
Alain Mille
Université de Lyon, France
,
Fabien Gandon
INRIA, France
,
Jacques Misselis
HP, France
,
Program Chairs:
Michael Rabinovich
Case Western Reserve University, USA
,
Steffen Staab
University of Koblenz-Landau, Germany

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Univ. de Lyon: Universite de Lyon

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 April 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW 2012

Sponsor:

Univ. de Lyon

WWW 2012: 21st World Wide Web Conference 2012

April 16 - 20, 2012

Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
443
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ensan FDu W(2019)Ad hoc retrieval via entity linking and semantic similarityKnowledge and Information Systems10.1007/s10115-018-1190-158:3(551-583)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s10115-018-1190-1
Chakrabarti SCollins-Thompson KMei QDavison BLiu YYilmaz E(2018)Knowledge Extraction and Inference from TextThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3210190(1399-1402)Online publication date: 27-Jun-2018
https://dl.acm.org/doi/10.1145/3209978.3210190
Balog KBalog K(2018)Term-Based Models for Entity RankingEntity-Oriented Search10.1007/978-3-319-93935-3_3(57-99)Online publication date: 3-Oct-2018
https://doi.org/10.1007/978-3-319-93935-3_3
Balog KBalog K(2018)IntroductionEntity-Oriented Search10.1007/978-3-319-93935-3_1(1-23)Online publication date: 3-Oct-2018
https://doi.org/10.1007/978-3-319-93935-3_1
Ceccarelli DLucchese COrlando SPerego RTrani SHe QIyengar ANejdl WPei JRastogi R(2013)Learning relatedness measures for entity linkingProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505711(139-148)Online publication date: 27-Oct-2013
https://dl.acm.org/doi/10.1145/2505515.2505711
Sawant UChakrabarti SSchwabe DAlmeida VGlaser HBaeza-Yates RMoon S(2013)Learning joint query interpretation and response rankingProceedings of the 22nd international conference on World Wide Web10.1145/2488388.2488484(1099-1110)Online publication date: 13-May-2013
https://dl.acm.org/doi/10.1145/2488388.2488484
Chakrabarti SRamakrishnan GRamamritham KSarawagi SSudarshan S(2013)Data-based research at IIT BombayACM SIGMOD Record10.1145/2481528.248153642:1(38-43)Online publication date: 1-May-2013
https://dl.acm.org/doi/10.1145/2481528.2481536
Makris CPlegas YTheodoridis EShin SMaldonado J(2013)Improved text annotation with Wikipedia entitiesProceedings of the 28th Annual ACM Symposium on Applied Computing10.1145/2480362.2480425(288-295)Online publication date: 18-Mar-2013
https://dl.acm.org/doi/10.1145/2480362.2480425
Kulkarni AGavankar CRamakrishnan GRaghavan S(2013)Semi-automatic Dictionary Curation for Domain-Specific OntologiesProceedings of the 2013 IEEE 25th International Conference on Tools with Artificial Intelligence10.1109/ICTAI.2013.112(727-734)Online publication date: 4-Nov-2013
https://dl.acm.org/doi/10.1109/ICTAI.2013.112
Gupta SChandramouli VChakrabarti S(2013)Web-scale entity annotation using MapReduce20th Annual International Conference on High Performance Computing10.1109/HiPC.2013.6799137(99-108)Online publication date: Dec-2013
https://doi.org/10.1109/HiPC.2013.6799137

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten