More Web Proxy on the site http://driver.im/

article

WOO: a scalable and multi-tenant platform for continuous knowledge base synthesis

Authors:

Ashwin Machanavajihala,

Mandar Rahurkar,

Aamod SaneAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 6, Issue 11

Pages 1114 - 1125

https://doi.org/10.14778/2536222.2536236

Published: 01 August 2013 Publication History

Abstract

Search, exploration and social experience on the Web has recently undergone tremendous changes with search engines, web portals and social networks offering a different perspective on information discovery and consumption. This new perspective is aimed at capturing user intents, and providing richer and highly connected experiences. The new battleground revolves around technologies for the ingestion, disambiguation and enrichment of entities from a variety of structured and unstructured data sources - we refer to this process as knowledge base synthesis. This paper presents the design, implementation and production deployment of the Web Of Objects (WOO) system, a Hadoop-based platform tackling such challenges. WOO has been designed and implemented to enable various products in Yahoo! to synthesize knowledge bases (KBs) of entities relevant to their domains. Currently, the implementation of WOO we describe is used by various Yahoo! properties such as Intonow, Yahoo! Local, Yahoo! Events and Yahoo! Search. This paper highlights: (i) challenges that arise in designing, building and operating a platform that handles multi-domain, multi-version, and multi-tenant disambiguation of web-scale knowledge bases (hundreds of millions of entities), (ii) the architecture and technical solutions we devised, and (iii) an evaluation on real-world production datasets.

References

[1]

N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: Ranking and clustering. J. ACM, 2008.

Digital Library

[2]

A. Arasu, C. Ré, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In ICDE, 2009.

Digital Library

[3]

K. Bellare, S. Iyengar, A. Parameswaran, and V. Rastogi. Active sampling for entity matching. In KDD, 2012.

Digital Library

[4]

M. Bilenko and R. J. Mooney. On evaluation and training-set construction for duplicate detection. In KDD, 2003.

[5]

R. Blanco, P. Mika, and S. Vigna. Effective and efficient entity search in rdf data. In ISWC, 2011.

Digital Library

[6]

A. Culotta, M. Wick, R. Hall, M. Marzilli, and A. McCallum. Canonicalization of database records using adaptive similarity measures. In KDD, 2007.

Digital Library

[7]

G. Dal Bianco, R. Galante, and C. A. Heuser. A fast approach for parallel deduplication on multicore processors. In SACC, 2011.

Digital Library

[8]

N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bohannon, S. Keerthi, and S. Merugu. A web of concepts. In PODS, 2009.

Digital Library

[9]

N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on the web. VLDB, 2012.

Digital Library

[10]

A. Efrati. Google gives search a refresh. Wall Street Journal, 2012.

[11]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 2007.

Digital Library

[12]

R. Fagin, P. G. Kolaitis, and L. Popa. Data exchange: getting to the core. ACM Trans. Database Syst., 2005.

Digital Library

[13]

J. Gemmell, B. Rubinstein, and A. K. Chandra. Improving entity resolution with global constraints. CoRR, 2011.

[14]

L. Kolb, A. Thor, and E. Rahm. Dedoop: Efficient Deduplication with Hadoop. In VLDB, 2012.

Digital Library

[15]

H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 2010.

Digital Library

[16]

H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 2010.

Digital Library

[17]

S. Kotoulas, J. Urbani, P. Boncz, and P. Mika. Robust runtime optimization and skew-resistant execution of analytical sparql queries on pig. In ISWC, 2012.

Digital Library

[18]

B. McNeill, H. Kardes, and A. Borthwick. Dynamic Record Blocking: Efficient Linking of Massive Databases in MapReduce. In QDB, 2012.

[19]

H. J. Moon, C. Curino, M. Ham, and C. Zaniolo. Prima: archiving and querying historical data with evolving schemas. In SIGMOD, 2009.

Digital Library

[20]

A. Pal, V. Rastogi, A. Machanavajjhala, and P. Bohannon. Information integration over time in unreliable and uncertain environments. In WWW, 2012.

Digital Library

[21]

G. Papadakis and W. Nejdl. Efficient entity resolution methods for heterogeneous information spaces. In ICDE Workshops, 2011.

Digital Library

[22]

A. D. Sarma, A. Jain, A. Machanavajjhala, and P. Bohannon. An automatic blocking mechanism for large-scale de-duplication tasks. In CIKM, 2012.

[23]

B. ten Cate and P. G. Kolaitis. Structural characterizations of schema-mapping languages. Commun. ACM, 53(1), 2010.

Digital Library

[24]

R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, 2010.

Digital Library

[25]

M. J. Welch, C. Drome, and A. Sane. High quality real-time incremental entity resolution in a knowledge base.

[26]

M. J. Welch, C. Drome, and A. Sane. Fast and accurate incremental entity resolution relative to a batch resolved corpus. In CIKM, 2012.

Digital Library

Cited By

Primpeli ABizer C(2022)Impact of the Characteristics of Multi-source Entity Matching Tasks on the Performance of Active Learning MethodsThe Semantic Web10.1007/978-3-031-06981-9_7(113-129)Online publication date: 29-May-2022
Christophides VEfthymiou VStefanidis K(2022)Entity Resolution in the Web of DataundefinedOnline publication date: 28-Mar-2022
Primpeli ABizer C(2021)Graph-Boosted Active Learning for Multi-source Entity ResolutionThe Semantic Web – ISWC 202110.1007/978-3-030-88361-4_11(182-199)Online publication date: 24-Oct-2021
Show More Cited By

Index Terms

WOO: a scalable and multi-tenant platform for continuous knowledge base synthesis
1. Information systems
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Interoperability

Index terms have been assigned to the content through auto-classification.

Recommendations

Design of composite virtual objects for service entity creation in WoO based IoT environment
ICOIN '16: Proceedings of the 2016 International Conference on Information Networking (ICOIN)

In recent years, physical things are becoming empowered with computation capabilities which enabled conversion of everyday objects into information appliances. Internet of Things (IoT) envisions a world where small intelligent objects around us that, ...
A study of results overlap and uniqueness among major web search engines

The performance and capabilities of Web search engines is an important and significant area of research. Millions of people world wide use Web search engines very day. This paper reports the results of a major study examining the overlap among results ...
Overlap Among Major Web Search Engines
ITNG '06: Proceedings of the Third International Conference on Information Technology: New Generations

Our study examined the overlap among results retrieved by three major Web search engines for a large set of more than 10,316 queries. Previous smaller studies have discussed the lack of overlap in results returned by Web search engines for the same ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 6, Issue 11

August 2013

237 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2013

Published in PVLDB Volume 6, Issue 11

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
175
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Primpeli ABizer C(2022)Impact of the Characteristics of Multi-source Entity Matching Tasks on the Performance of Active Learning MethodsThe Semantic Web10.1007/978-3-031-06981-9_7(113-129)Online publication date: 29-May-2022
Christophides VEfthymiou VStefanidis K(2022)Entity Resolution in the Web of DataundefinedOnline publication date: 28-Mar-2022
Primpeli ABizer C(2021)Graph-Boosted Active Learning for Multi-source Entity ResolutionThe Semantic Web – ISWC 202110.1007/978-3-030-88361-4_11(182-199)Online publication date: 24-Oct-2021
Ni CSum Liu KTorzec N(2020)Layered Graph Embedding for Entity Recommendation using Wikipedia in the Yahoo! Knowledge GraphCompanion Proceedings of the Web Conference 202010.1145/3366424.3383570(811-818)Online publication date: 20-Apr-2020
Saeedi APeukert ERahm E(2020)Incremental Multi-source Entity Resolution for Knowledge Graph CompletionThe Semantic Web10.1007/978-3-030-49461-2_23(393-408)Online publication date: 31-May-2020
Li TZhong JLiu JWu WZhang C(2018)Ease.mlProceedings of the VLDB Endowment10.1145/3187009.317773711:5(607-620)Online publication date: 1-Jan-2018
Li TZhong JLiu JWu WZhang C(2018)Ease.mlProceedings of the VLDB Endowment10.1145/3177732.317773711:5(607-620)Online publication date: 5-Oct-2018
Nentwig MRahm E(2018)Incremental Clustering on Linked Data2018 IEEE International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW.2018.00084(531-538)Online publication date: Nov-2018
Junghanns MPetermann ANeumann MRahm E(2017)Management and Analysis of Big Graph Data: Current Systems and Open ChallengesHandbook of Big Data Technologies10.1007/978-3-319-49340-4_14(457-505)Online publication date: 26-Feb-2017
Zhang JJie LRahman AXie SChang YYu PBailey JMoffat AAggarwal Cde Rijke MKumar RMurdock VSellis TYu J(2015)Learning Entity Types from Query Logs via Graph-Based ModelingProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806498(603-612)Online publication date: 17-Oct-2015
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents