[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

WOO: a scalable and multi-tenant platform for continuous knowledge base synthesis

Published: 01 August 2013 Publication History

Abstract

Search, exploration and social experience on the Web has recently undergone tremendous changes with search engines, web portals and social networks offering a different perspective on information discovery and consumption. This new perspective is aimed at capturing user intents, and providing richer and highly connected experiences. The new battleground revolves around technologies for the ingestion, disambiguation and enrichment of entities from a variety of structured and unstructured data sources - we refer to this process as knowledge base synthesis. This paper presents the design, implementation and production deployment of the Web Of Objects (WOO) system, a Hadoop-based platform tackling such challenges. WOO has been designed and implemented to enable various products in Yahoo! to synthesize knowledge bases (KBs) of entities relevant to their domains. Currently, the implementation of WOO we describe is used by various Yahoo! properties such as Intonow, Yahoo! Local, Yahoo! Events and Yahoo! Search. This paper highlights: (i) challenges that arise in designing, building and operating a platform that handles multi-domain, multi-version, and multi-tenant disambiguation of web-scale knowledge bases (hundreds of millions of entities), (ii) the architecture and technical solutions we devised, and (iii) an evaluation on real-world production datasets.

References

[1]
N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: Ranking and clustering. J. ACM, 2008.
[2]
A. Arasu, C. Ré, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In ICDE, 2009.
[3]
K. Bellare, S. Iyengar, A. Parameswaran, and V. Rastogi. Active sampling for entity matching. In KDD, 2012.
[4]
M. Bilenko and R. J. Mooney. On evaluation and training-set construction for duplicate detection. In KDD, 2003.
[5]
R. Blanco, P. Mika, and S. Vigna. Effective and efficient entity search in rdf data. In ISWC, 2011.
[6]
A. Culotta, M. Wick, R. Hall, M. Marzilli, and A. McCallum. Canonicalization of database records using adaptive similarity measures. In KDD, 2007.
[7]
G. Dal Bianco, R. Galante, and C. A. Heuser. A fast approach for parallel deduplication on multicore processors. In SACC, 2011.
[8]
N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bohannon, S. Keerthi, and S. Merugu. A web of concepts. In PODS, 2009.
[9]
N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on the web. VLDB, 2012.
[10]
A. Efrati. Google gives search a refresh. Wall Street Journal, 2012.
[11]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 2007.
[12]
R. Fagin, P. G. Kolaitis, and L. Popa. Data exchange: getting to the core. ACM Trans. Database Syst., 2005.
[13]
J. Gemmell, B. Rubinstein, and A. K. Chandra. Improving entity resolution with global constraints. CoRR, 2011.
[14]
L. Kolb, A. Thor, and E. Rahm. Dedoop: Efficient Deduplication with Hadoop. In VLDB, 2012.
[15]
H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 2010.
[16]
H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 2010.
[17]
S. Kotoulas, J. Urbani, P. Boncz, and P. Mika. Robust runtime optimization and skew-resistant execution of analytical sparql queries on pig. In ISWC, 2012.
[18]
B. McNeill, H. Kardes, and A. Borthwick. Dynamic Record Blocking: Efficient Linking of Massive Databases in MapReduce. In QDB, 2012.
[19]
H. J. Moon, C. Curino, M. Ham, and C. Zaniolo. Prima: archiving and querying historical data with evolving schemas. In SIGMOD, 2009.
[20]
A. Pal, V. Rastogi, A. Machanavajjhala, and P. Bohannon. Information integration over time in unreliable and uncertain environments. In WWW, 2012.
[21]
G. Papadakis and W. Nejdl. Efficient entity resolution methods for heterogeneous information spaces. In ICDE Workshops, 2011.
[22]
A. D. Sarma, A. Jain, A. Machanavajjhala, and P. Bohannon. An automatic blocking mechanism for large-scale de-duplication tasks. In CIKM, 2012.
[23]
B. ten Cate and P. G. Kolaitis. Structural characterizations of schema-mapping languages. Commun. ACM, 53(1), 2010.
[24]
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, 2010.
[25]
M. J. Welch, C. Drome, and A. Sane. High quality real-time incremental entity resolution in a knowledge base.
[26]
M. J. Welch, C. Drome, and A. Sane. Fast and accurate incremental entity resolution relative to a batch resolved corpus. In CIKM, 2012.

Cited By

View all
  • (2022)Impact of the Characteristics of Multi-source Entity Matching Tasks on the Performance of Active Learning MethodsThe Semantic Web10.1007/978-3-031-06981-9_7(113-129)Online publication date: 29-May-2022
  • (2022)Entity Resolution in the Web of DataundefinedOnline publication date: 28-Mar-2022
  • (2021)Graph-Boosted Active Learning for Multi-source Entity ResolutionThe Semantic Web – ISWC 202110.1007/978-3-030-88361-4_11(182-199)Online publication date: 24-Oct-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 6, Issue 11
August 2013
237 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2013
Published in PVLDB Volume 6, Issue 11

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Impact of the Characteristics of Multi-source Entity Matching Tasks on the Performance of Active Learning MethodsThe Semantic Web10.1007/978-3-031-06981-9_7(113-129)Online publication date: 29-May-2022
  • (2022)Entity Resolution in the Web of DataundefinedOnline publication date: 28-Mar-2022
  • (2021)Graph-Boosted Active Learning for Multi-source Entity ResolutionThe Semantic Web – ISWC 202110.1007/978-3-030-88361-4_11(182-199)Online publication date: 24-Oct-2021
  • (2020)Layered Graph Embedding for Entity Recommendation using Wikipedia in the Yahoo! Knowledge GraphCompanion Proceedings of the Web Conference 202010.1145/3366424.3383570(811-818)Online publication date: 20-Apr-2020
  • (2020)Incremental Multi-source Entity Resolution for Knowledge Graph CompletionThe Semantic Web10.1007/978-3-030-49461-2_23(393-408)Online publication date: 31-May-2020
  • (2018)Ease.mlProceedings of the VLDB Endowment10.1145/3187009.317773711:5(607-620)Online publication date: 1-Jan-2018
  • (2018)Ease.mlProceedings of the VLDB Endowment10.1145/3177732.317773711:5(607-620)Online publication date: 5-Oct-2018
  • (2018)Incremental Clustering on Linked Data2018 IEEE International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW.2018.00084(531-538)Online publication date: Nov-2018
  • (2017)Management and Analysis of Big Graph Data: Current Systems and Open ChallengesHandbook of Big Data Technologies10.1007/978-3-319-49340-4_14(457-505)Online publication date: 26-Feb-2017
  • (2015)Learning Entity Types from Query Logs via Graph-Based ModelingProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806498(603-612)Online publication date: 17-Oct-2015
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media