More Web Proxy on the site http://driver.im/

research-article

Parallel linkage

Authors:

Dongwon LeeAuthors Info & Claims

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Pages 283 - 292

https://doi.org/10.1145/1321440.1321482

Published: 06 November 2007 Publication History

Abstract

We study the parallelization of the (record) linkage problem - i.e., to identify matching records between two collections of records, A and B. One of main idiosyncrasies of the linkage problem, compared to Database join, is the fact that once two records a in A and b in B are matched and merged to c, c needs to be compared to the rest of records in A and B again since it may incur new matching. This re-feeding stage of the linkage problem requires its solution to be iterative, and complicates the problem significantly. Toward this problem, we first discuss three plausible scenarios of inputs - when both collections are clean, only one is clean, and both are dirty. Then, we show that the intricate interplay between match and merge can exploit the characteristics of each scenario to achieve good parallelization. Our parallel algorithms achieve 6.55-7.49 times faster in speedup compared to sequential ones with 8 processors, and 11.15-18.56% improvement in efficiency compared to P-Swoosh.

References

[1]

O. Benjelloun, H. Garcia-Molina, Q. Su, and J. Widom. "Swoosh: A Generic Approach to Entity Resolution". Technical report, Stanford University, 2005.

[2]

O. Benjelloun et al. "D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution". Technical report, Stanford University, 2006.

[3]

M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. "Adaptive Name-Matching in Information Integration". IEEE Intelligent System, 18(5):16--23, 2003.

Digital Library

[4]

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. "Robust and Efficient Fuzzy Match for Online Data Cleaning". In ACM SIGMOD, 2003.

Digital Library

[5]

P. Christen, T. Churches, and M. Hegland. "A Parallel Open Source Data Linkage System". In Springer Lecture Notes in Artificial Intelligence, Sydney, Australia, May 2004.

[6]

R. Cilibrasi and P. M. B. Vitanyi. "The Google Similarity Distance". IEEE Trans. on Knowledge and Data Engineering (TKDE), (3):370--383, 2007.

Digital Library

[7]

X. Dong, A. Y. Halevy, and J. Madhavan. "Reference Reconciliation in Complex Information Spaces". In ACM SIGMOD, 2005.

Digital Library

[8]

I. P. Fellegi and A. B. Sunter. "A Theory for Record Linkage". J. of the American Statistical Society, 64:1183--1210, 1969.

[9]

A. Grama, A. Gupta, G. Karypis, and V. Kumar. "Introduction to Parallel Computing (2nd Edition)". Addison Wesley, 2003.

[10]

L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. "Text Joins in an RDBMS for Web Data Integration". In Int'l World Wide Web Conf. (WWW), 2003.

Digital Library

[11]

M. A. Hernandez and S. J. Stolfo. "The Merge/Purge Problem for Large Databases". In ACM SIGMOD, 1995.

Digital Library

[12]

D. V. Kalashnikov, S. Mehrotra, and Z. Chen. "Exploiting Relationships for Domain-independent Data Cleaning". In SIAM Data Mining (SDM) Conf., 2005.

[13]

H. Kawai et al. "P-Swoosh: Parallel Algorithm for Generic Entity Resolution". Technical report, Stanford University, 2006.

[14]

D. Menestrina, O. Benjelloun, and H. Garcia-Molina. "Generic Entity Resolution with Data Confidences". In VLDB CleanDB Workshop, Seoul, Korea, Sep. 2006.

[15]

B.-W. On, N. Koudas, D. Lee, and D. Srivastava. "Group Linkage". In IEEE ICDE, Istanbul, Turkey, Apr. 2007.

[16]

B.-W. On, D. Lee, J. Kang, and P. Mitra. "Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework". In ACM/IEEE Joint Conf. on Digital Libraries (JCDL), Jun. 2005.

Digital Library

[17]

H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. "Identity Uncertainty and Citation Matching". In Advances in Neural Information Processing Systems. MIT Press, 2003.

[18]

S. Sarawagi and A. Bhamidipaty. "Interactive Deduplication using Active Learning". In ACM KDD, 2002.

Digital Library

[19]

D. A. Schneider and D. J. DeWitt. "A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment". In ACM SIGMOD, Portland, OR, May 1989.

Digital Library

[20]

W. Shen, X. Li, and A. Doan. "Constraint-based Entity Matching". In AAAI, 2005.

Digital Library

[21]

W. E. Winkler. "The State of Record Linkage and Current Research Problems". Technical report, US Bureau of the Census, Apr. 1999.

Cited By

Baihan AAmmar RAseltine RBaihan MRajasekaran S(2020)Efficient Sequential and Parallel Algorithms for Incremental Record LinkageComputational Advances in Bio and Medical Sciences10.1007/978-3-030-46165-2_3(26-38)Online publication date: 29-Apr-2020
https://doi.org/10.1007/978-3-030-46165-2_3
Boratto MAlonso PPinto CMelo PBarreto MDenaxas S(2019)Exploring hybrid parallel systems for probabilistic record linkageThe Journal of Supercomputing10.1007/s11227-018-2328-375:3(1137-1149)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11227-018-2328-3
Dou CCui YSun DWong RAtif MLi GRanjan R(2019)Unsupervised blocking and probabilistic parallelisation for record matching of distributed big dataThe Journal of Supercomputing10.1007/s11227-017-2008-875:2(623-645)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.1007/s11227-017-2008-8
Show More Cited By

Index Terms

Parallel linkage

Recommendations

Subsequent patient visit detection in a high volume OPD using record linkage techniques
COMPUTE '10: Proceedings of the Third Annual ACM Bangalore Conference

Record or data linkage techniques are used to link records which represent the same entity (e.g. patient, customer, citation, etc.) in one or more data sets where a unique identifier for each entity is not available in all or any of the data sets to be ...
Leveraging Social Media Signals for Record Linkage
WWW '18: Proceedings of the 2018 World Wide Web Conference

Many data-intensive applications collect (structured) data from a variety of sources. A key task in this process is record linkage, which is the problem of determining the records from these sources that refer to the same real-world entities. ...
Multiple instance learning for group record linkage
PAKDD'12: Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I

Record linkage is the process of identifying records that refer to the same entities from different data sources. While most research efforts are concerned with linking individual records, new approaches have recently been proposed to link groups of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

November 2007

1048 pages

ISBN:9781595938039

DOI:10.1145/1321440

Co-chair:
Alberto H. F. Laender,
Conference Chairs:
André O. Falcão
Universidade de Lisboa, Portugal
,
Øystein Haug Olsen,
General Chair:
Mário J. Silva
(Universidade de Lisboa, Portugal)
,
Program Chairs:
Ricardo Baeza-Yates,
Deborah L. McGuinness,
Bjorn Olstad

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM07

Sponsor:

CIKM07: Conference on Information and Knowledge Management

November 6 - 10, 2007

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
421
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Baihan AAmmar RAseltine RBaihan MRajasekaran S(2020)Efficient Sequential and Parallel Algorithms for Incremental Record LinkageComputational Advances in Bio and Medical Sciences10.1007/978-3-030-46165-2_3(26-38)Online publication date: 29-Apr-2020
https://doi.org/10.1007/978-3-030-46165-2_3
Boratto MAlonso PPinto CMelo PBarreto MDenaxas S(2019)Exploring hybrid parallel systems for probabilistic record linkageThe Journal of Supercomputing10.1007/s11227-018-2328-375:3(1137-1149)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11227-018-2328-3
Dou CCui YSun DWong RAtif MLi GRanjan R(2019)Unsupervised blocking and probabilistic parallelisation for record matching of distributed big dataThe Journal of Supercomputing10.1007/s11227-017-2008-875:2(623-645)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.1007/s11227-017-2008-8
Vatsalan DKarapiperis DGkoulalas-Divanis A(2019)An Overview of Big Data Issues in Privacy-Preserving Record LinkageAlgorithmic Aspects of Cloud Computing10.1007/978-3-030-19759-9_8(118-136)Online publication date: 28-Apr-2019
https://doi.org/10.1007/978-3-030-19759-9_8
Schneider AMukherjee ADragut EChampin PGandon FMédini LLalmas MIpeirotis P(2018)Leveraging Social Media Signals for Record LinkageProceedings of the 2018 World Wide Web Conference10.1145/3178876.3186018(1195-1204)Online publication date: 10-Apr-2018
https://dl.acm.org/doi/10.1145/3178876.3186018
(2017)Stream-based live entity resolution approach with adaptive duplicate count strategyInternational Journal of Web and Grid Services10.1504/IJWGS.2017.08516713:3(351-373)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1504/IJWGS.2017.085167
Papadakis GBereta KPalpanas TKoubarakis M(2017)Multi-core Meta-blocking for Big Linked DataProceedings of the 13th International Conference on Semantic Systems10.1145/3132218.3132230(33-40)Online publication date: 11-Sep-2017
https://dl.acm.org/doi/10.1145/3132218.3132230
Efthymiou VPapadakis GPapastefanatos GStefanidis KPalpanas T(2017)Parallel meta-blocking for scaling entity resolution over big heterogeneous dataInformation Systems10.1016/j.is.2016.12.00165:C(137-157)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1016/j.is.2016.12.001
Mamun AAseltine RRajasekaran S(2016)Efficient Record Linkage Algorithms Using Complete Linkage ClusteringPLOS ONE10.1371/journal.pone.015444611:4(e0154446)Online publication date: 28-Apr-2016
https://doi.org/10.1371/journal.pone.0154446
Dou CSun DChen YLi GLiu J(2016)Probabilistic parallelisation of blocking non-matched records for big data2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7841009(3465-3473)Online publication date: Dec-2016
https://doi.org/10.1109/BigData.2016.7841009
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents