[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1321440.1321482acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Parallel linkage

Published: 06 November 2007 Publication History

Abstract

We study the parallelization of the (record) linkage problem - i.e., to identify matching records between two collections of records, A and B. One of main idiosyncrasies of the linkage problem, compared to Database join, is the fact that once two records a in A and b in B are matched and merged to c, c needs to be compared to the rest of records in A and B again since it may incur new matching. This re-feeding stage of the linkage problem requires its solution to be iterative, and complicates the problem significantly. Toward this problem, we first discuss three plausible scenarios of inputs - when both collections are clean, only one is clean, and both are dirty. Then, we show that the intricate interplay between match and merge can exploit the characteristics of each scenario to achieve good parallelization. Our parallel algorithms achieve 6.55-7.49 times faster in speedup compared to sequential ones with 8 processors, and 11.15-18.56% improvement in efficiency compared to P-Swoosh.

References

[1]
O. Benjelloun, H. Garcia-Molina, Q. Su, and J. Widom. "Swoosh: A Generic Approach to Entity Resolution". Technical report, Stanford University, 2005.
[2]
O. Benjelloun et al. "D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution". Technical report, Stanford University, 2006.
[3]
M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. "Adaptive Name-Matching in Information Integration". IEEE Intelligent System, 18(5):16--23, 2003.
[4]
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. "Robust and Efficient Fuzzy Match for Online Data Cleaning". In ACM SIGMOD, 2003.
[5]
P. Christen, T. Churches, and M. Hegland. "A Parallel Open Source Data Linkage System". In Springer Lecture Notes in Artificial Intelligence, Sydney, Australia, May 2004.
[6]
R. Cilibrasi and P. M. B. Vitanyi. "The Google Similarity Distance". IEEE Trans. on Knowledge and Data Engineering (TKDE), (3):370--383, 2007.
[7]
X. Dong, A. Y. Halevy, and J. Madhavan. "Reference Reconciliation in Complex Information Spaces". In ACM SIGMOD, 2005.
[8]
I. P. Fellegi and A. B. Sunter. "A Theory for Record Linkage". J. of the American Statistical Society, 64:1183--1210, 1969.
[9]
A. Grama, A. Gupta, G. Karypis, and V. Kumar. "Introduction to Parallel Computing (2nd Edition)". Addison Wesley, 2003.
[10]
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. "Text Joins in an RDBMS for Web Data Integration". In Int'l World Wide Web Conf. (WWW), 2003.
[11]
M. A. Hernandez and S. J. Stolfo. "The Merge/Purge Problem for Large Databases". In ACM SIGMOD, 1995.
[12]
D. V. Kalashnikov, S. Mehrotra, and Z. Chen. "Exploiting Relationships for Domain-independent Data Cleaning". In SIAM Data Mining (SDM) Conf., 2005.
[13]
H. Kawai et al. "P-Swoosh: Parallel Algorithm for Generic Entity Resolution". Technical report, Stanford University, 2006.
[14]
D. Menestrina, O. Benjelloun, and H. Garcia-Molina. "Generic Entity Resolution with Data Confidences". In VLDB CleanDB Workshop, Seoul, Korea, Sep. 2006.
[15]
B.-W. On, N. Koudas, D. Lee, and D. Srivastava. "Group Linkage". In IEEE ICDE, Istanbul, Turkey, Apr. 2007.
[16]
B.-W. On, D. Lee, J. Kang, and P. Mitra. "Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework". In ACM/IEEE Joint Conf. on Digital Libraries (JCDL), Jun. 2005.
[17]
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. "Identity Uncertainty and Citation Matching". In Advances in Neural Information Processing Systems. MIT Press, 2003.
[18]
S. Sarawagi and A. Bhamidipaty. "Interactive Deduplication using Active Learning". In ACM KDD, 2002.
[19]
D. A. Schneider and D. J. DeWitt. "A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment". In ACM SIGMOD, Portland, OR, May 1989.
[20]
W. Shen, X. Li, and A. Doan. "Constraint-based Entity Matching". In AAAI, 2005.
[21]
W. E. Winkler. "The State of Record Linkage and Current Research Problems". Technical report, US Bureau of the Census, Apr. 1999.

Cited By

View all
  • (2020)Efficient Sequential and Parallel Algorithms for Incremental Record LinkageComputational Advances in Bio and Medical Sciences10.1007/978-3-030-46165-2_3(26-38)Online publication date: 29-Apr-2020
  • (2019)Exploring hybrid parallel systems for probabilistic record linkageThe Journal of Supercomputing10.1007/s11227-018-2328-375:3(1137-1149)Online publication date: 1-Mar-2019
  • (2019)Unsupervised blocking and probabilistic parallelisation for record matching of distributed big dataThe Journal of Supercomputing10.1007/s11227-017-2008-875:2(623-645)Online publication date: 1-Feb-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
November 2007
1048 pages
ISBN:9781595938039
DOI:10.1145/1321440
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. parallel linkage
  2. record linkage

Qualifiers

  • Research-article

Conference

CIKM07

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Efficient Sequential and Parallel Algorithms for Incremental Record LinkageComputational Advances in Bio and Medical Sciences10.1007/978-3-030-46165-2_3(26-38)Online publication date: 29-Apr-2020
  • (2019)Exploring hybrid parallel systems for probabilistic record linkageThe Journal of Supercomputing10.1007/s11227-018-2328-375:3(1137-1149)Online publication date: 1-Mar-2019
  • (2019)Unsupervised blocking and probabilistic parallelisation for record matching of distributed big dataThe Journal of Supercomputing10.1007/s11227-017-2008-875:2(623-645)Online publication date: 1-Feb-2019
  • (2019)An Overview of Big Data Issues in Privacy-Preserving Record LinkageAlgorithmic Aspects of Cloud Computing10.1007/978-3-030-19759-9_8(118-136)Online publication date: 28-Apr-2019
  • (2018)Leveraging Social Media Signals for Record LinkageProceedings of the 2018 World Wide Web Conference10.1145/3178876.3186018(1195-1204)Online publication date: 10-Apr-2018
  • (2017)Stream-based live entity resolution approach with adaptive duplicate count strategyInternational Journal of Web and Grid Services10.1504/IJWGS.2017.08516713:3(351-373)Online publication date: 1-Jan-2017
  • (2017)Multi-core Meta-blocking for Big Linked DataProceedings of the 13th International Conference on Semantic Systems10.1145/3132218.3132230(33-40)Online publication date: 11-Sep-2017
  • (2017)Parallel meta-blocking for scaling entity resolution over big heterogeneous dataInformation Systems10.1016/j.is.2016.12.00165:C(137-157)Online publication date: 1-Apr-2017
  • (2016)Efficient Record Linkage Algorithms Using Complete Linkage ClusteringPLOS ONE10.1371/journal.pone.015444611:4(e0154446)Online publication date: 28-Apr-2016
  • (2016)Probabilistic parallelisation of blocking non-matched records for big data2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7841009(3465-3473)Online publication date: Dec-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media