More Web Proxy on the site http://driver.im/

article

Frameworks for entity matching: A comparison

Authors:

Erhard RahmAuthors Info & Claims

Data & Knowledge Engineering, Volume 69, Issue 2

Pages 197 - 210

https://doi.org/10.1016/j.datak.2009.10.003

Published: 01 February 2010 Publication History

Abstract

Entity matching is a crucial and difficult task for data integration. Entity matching frameworks provide several methods and their combination to effectively solve different match tasks. In this paper, we comparatively analyze 11 proposed frameworks for entity matching. Our study considers both frameworks which do or do not utilize training data to semi-automatically find an entity matching strategy to solve a given match task. Moreover, we consider support for blocking and the combination of different match algorithms. We further study how the different frameworks have been evaluated. The study aims at exploring the current state of the art in research prototypes of entity matching frameworks and their evaluations. The proposed criteria should be helpful to identify promising framework approaches and enable categorizing and comparatively assessing additional entity matching frameworks and their evaluations.

References

[1]

A. Arasu, V. Ganti, R. Kaushik, Efficient exact set-similarity joins, in: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB '06), 2006, pp. 918-929.

[2]

A. Arasu, C. Ré, D. Suciu, Large-scale deduplication with constraints using dedupalog, in: Proceedings of the 25th International Conference on Data Engineering (ICDE '09), 2009, pp. 952-963.

Digital Library

[3]

Batini, C. and Scannapieco, M., Data Quality: Concepts, Methodologies and Techniques, Data-Centric Systems and Applications. 2006. Springer.

[4]

R. Baxter, P. Christen, T. Churches, A comparison of fast blocking methods for record linkage, in: Proceedings of the Ninth ACM SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, 2003, pp. 25-27.

[5]

Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E. and Widom, J., Swoosh: a generic approach to entity resolution. VLDB J. v18 i1. 255-276.

[6]

I. Bhattacharya, L. Getoor, A latent dirichlet model for unsupervised entity resolution, in: Proceedings of the Sixth SIAM International Conference on Data Mining (SDM '06), 2006, pp. 47-58.

[7]

Bhattacharya, I. and Getoor, L., Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data. v1 i1. 5

[8]

Bhattacharya, I. and Getoor, L., Query-time entity resolution. J. Artif. Intell. Res. (JAIR). v30. 621-657.

[9]

M. Bilenko, S. Basu, M. Sahami, Adaptive product normalization: using online learning for record linkage in comparison shopping, in: Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM '05), 2005, pp. 58-65.

[10]

M. Bilenko, B. Kamath, R.J. Mooney, Adaptive blocking: learning to scale up record linkage, in: Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM '06), 2006, pp. 87-96.

[11]

M. Bilenko, R.J. Mooney, Adaptive duplicate detection using learnable string similarity measures, in: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '03), 2003, pp. 39-48.

Digital Library

[12]

M. Bilenko, R.J. Mooney, On evaluation and training-set construction for duplicate detection, in: Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003, pp. 7-12.

[13]

S. Chaudhuri, B.-C. Chen, V. Ganti, R. Kaushik, Example-driven design of efficient record matching queries, in: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07), 2007, pp. 327-338.

Digital Library

[14]

Chaudhuri, S., Ganti, V. and Xin, D., Mining document collections to facilitate accurate approximate entity matching. PVLDB. v2 i1. 395-406.

[15]

Z. Chen, D.V. Kalashnikov, S. Mehrotra, Exploiting relationships for object consolidation, in: Proceedings of the International Workshop on Information Quality in Information Systems (IQIS '05), 2005, pp. 47-58.

[16]

Z. Chen, D.V. Kalashnikov, S. Mehrotra, Exploiting context analysis for combining multiple entity resolution systems, in: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD '09), 2009, pp. 207-218.

Digital Library

[17]

P. Christen, Automatic training example selection for scalable unsupervised record linkage, in: Proceedings of the 12th Pacific-Asia on Conference on Knowledge Discovery and Data Mining (PAKDD '08), 2008, pp. 511-518.

[18]

P. Christen, FEBRL: a freely available record linkage system with a graphical user interface, in: Proceedings of the Second Australasian workshop on Health Data and Knowledge Management (HDKM '08), Australian Computer Society Inc., Darlinghurst, Australia, Australia, 2008, pp. 17-25.

[19]

W.W. Cohen, H.A. Kautz, D.A. McAllester, Hardening soft information sources, in: Proceedings of the Sixth ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '00), 2000, pp. 255-259.

Digital Library

[20]

W.W. Cohen, P. Ravikumar, S.E. Fienberg, A comparison of string distance metrics for name-matching tasks, in: Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb '03), 2003, pp. 73-78.

[21]

A. Culotta, A. McCallum, Joint deduplication of multiple record types in relational data, in: Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management (CIKM '05), 2005, pp. 257-258.

Digital Library

[22]

X. Dong, A.Y. Halevy, J. Madhavan, Reference reconciliation in complex information spaces, in: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD '05), 2005, pp. 85-96.

Digital Library

[23]

U. Draisbach, F. Naumann, A comparison and generalization of blocking and windowing algorithms for duplicate detection, in: Proceedings of QDB 2009 Workshop at VLDB, 2009.

[24]

M.G. Elfeky, A.K. Elmagarmid, V.S. Verykios, TAILOR: a record linkage tool box, in: Proceedings of the 18th International Conference on Data Engineering (ICDE '02), 2002, pp. 17-28.

[25]

Elmagarmid, A.K., Ipeirotis, P.G. and Verykios, V.S., Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. v19 i1. 1-16.

[26]

Fellegi, I.P. and Sunter, A.B., A theory for record linkage. J. Am. Stat. Assoc. v64 i328. 1183-1210.

[27]

H. Galhardas, D. Florescu, D. Shasha, E. Simon, AJAX: an extensible data cleaning tool, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD '00), 2000, p. 590.

[28]

L. Gu, R. Baxter, D. Vickers, C. Rainsford, Record linkage: current practice and future directions, Tech. Rep., CSIRO Mathematical and Information Sciences, 2003.

[29]

S. Guha, N. Koudas, A. Marathe, D. Srivastava, Merging the results of approximate match operations, in: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB '04), 2004, pp. 636-647.

[30]

M. Hadjieleftheriou, A. Chandel, N. Koudas, D. Srivastava, Fast indexes and algorithms for set similarity selection queries, in: Proceedings of the 24th International Conference on Data Engineering (ICDE '08), 2008, pp. 267-276.

Digital Library

[31]

M. Hadjieleftheriou, C. Li, Efficient approximate search on string collections, Seminar Given at ICDE, 2009.

[32]

Hassanzadeh, O., Chiang, F., Miller, R.J. and Lee, H.C., Framework for evaluating clustering algorithms in duplicate detection. PVLDB. v2 i1. 1282-1293.

[33]

M.A. Hernández, S.J. Stolfo, The merge/purge problem for large databases, in: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD '95), 1995, pp. 127-138.

[34]

Kang, H., Getoor, L., Shneiderman, B., Bilgic, M. and Licamele, L., Interactive entity resolution in relational data: a visual analytic tool and its evaluation. IEEE Trans. Vis. Comput. Graph. v14 i5. 999-1014.

[35]

N. Koudas, S. Sarawagi, D. Srivastava, Record linkage: similarity measures and algorithms, in: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD '06), 2006, pp. 802-803.

Digital Library

[36]

H. Köpcke, E. Rahm, Training selection for tuning entity matching, in: Proceedings of the Sixth International Workshop on Quality in Databases and Management of Uncertain Data (QDB/MUD '08), 2008, pp. 3-12.

[37]

M.-L. Lee, T.W. Ling, W.L. Low, Intelliclean: a knowledge-based intelligent data cleaner, in: Proceedings of the Sixth ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '00), 2000, pp. 290-294.

Digital Library

[38]

L. Leitão, P. Calado, M. Weis, Structure-based inference of XML similarity for fuzzy duplicate detection, in: Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM '07), 2007, pp. 293-302.

Digital Library

[39]

R. Lengu, P. Missier, A.A.A. Fernandes, G. Guerrini, M. Mesiti, Time-completeness trade-offs in record linkage using adaptive query processing, in: Proceedings of the 12th International Conference on Extending Database Technology (EDBT '09), ACM, New York, NY, USA, 2009, pp. 851-861.

Digital Library

[40]

A. McCallum, K. Nigam, L.H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, in: Proceedings of the Sixth ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '00), 2000, pp. 169-178.

Digital Library

[41]

A. McCallum, B. Wellner, Conditional models of identity uncertainty with application to noun coreference, in: Advances in Neural Information Processing Systems, vol. 17. 2004, pp. 905-912.

[42]

M. Michelson, C.A. Knoblock, Learning blocking schemes for record linkage, in: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI '06), 2006.

[43]

S. Minton, C. Nanjo, C.A. Knoblock, M. Michalowski, M. Michelson, A heterogeneous field matching method for record linkage, in: Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM '05), 2005, pp. 314-321.

Digital Library

[44]

M. Neiling, S. Jurk, H.J. Lenz, F. Naumann, Object identification quality, in: Proceedings of the International Workshop on Data Quality in Cooperative Information Systsems (DQCIS '03), 2003, pp. 187-198.

[45]

Newcombe, H.B., Kennedy, J.M., Axford, S.J. and James, A.P., Automatic linkage of vital records. Science. v130. 954-959.

[46]

J.C. Pinheiro, D.X. Sun, Methods for linking and mining massive heterogeneous databases, in: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD '98), 1998, pp. 309-313.

Digital Library

[47]

Rahm, E. and Do, H.H., Data cleaning: problems and current approaches. IEEE Data Eng. Bull. v23 i4. 3-13.

[48]

V. Raman, J.M. Hellerstein, Potter's wheel: an interactive data cleaning system, in: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB '01), 2001, pp. 381-390.

[49]

S. Sarawagi, A. Bhamidipaty, Interactive deduplication using active learning, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '02), 2002, pp. 269-278.

Digital Library

[50]

S. Sarawagi, A. Kirpal, Efficient set joins on similarity predicates, in: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD '04), 2004, pp. 743-754.

Digital Library

[51]

P. Singla, P. Domingos, Multi-relational record linkage, in: Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining, 2004, pp. 31-48.

[52]

P. Singla, P. Domingos, Entity resolution with markov logic, in: Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM '06), 2006, pp. 572-582.

[53]

Tejada, S., Knoblock, C.A. and Minton, S., Learning object identification rules for information integration. Inf. Syst. v26 i8. 607-633.

[54]

S. Tejada, C.A. Knoblock, S. Minton, Learning domain-independent string transformation weights for high accuracy object identification, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '02), 2002, pp. 350-359.

Digital Library

[55]

A. Thor, E. Rahm, MOMA - a mapping-based object matching system, in: Proceedings of the Third Biennial Conference on Innovative Data Systems Research (CIDR '07), 2007, pp. 247-258.

[56]

Verykios, V.S., Moustakides, G.V. and Elfeky, M.G., A Bayesian decision model for cost optimal record matching. VLDB J. v12 i1. 28-40.

[57]

M. Weis, F. Naumann, Dogmatix tracks down duplicates in XML, in: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD '05), 2005, pp. 431-442.

Digital Library

[58]

M. Weis, F. Naumann, F. Brosy, A duplicate detection benchmark for XML (and relational) data, in: SIGMOD 2006 Workshop on Information Quality for Information Systems (IQIS '06), 2006.

[59]

S.E. Whang, D. Menestrina, G. Koutrika, M. Theobald, H. Garcia-Molina, Entity resolution with iterative blocking, in: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD '09), 2009, pp. 219-232.

Digital Library

[60]

W.E. Winkler, Overview of record linkage and current research directions, Tech. Rep., US Bureau of the Census, Washington, DC, 2006.

[61]

Xiao, C., Wang, W. and Lin, X., Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB. v1 i1. 933-944.

[62]

Zhao, H. and Ram, S., Entity identification for heterogeneous database integration - a Multiple Classifier System approach and empirical evaluation. Inf. Syst. v30 i2. 119-132.

[63]

Zhao, H. and Ram, S., Entity matching across heterogeneous data sources: an approach based on constrained cascade generalization. Data Knowl. Eng. v66 i3. 368-381.

Cited By

Sohail AQounain W(2024)Locality sensitive blocking (LSB)Journal of Information Science10.1177/0165551522112196350:6(1400-1413)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1177/01655515221121963
Foxcroft JChristen PAntonie L(2024)Class Ratio and Its Implications for Reproducibility and Performance in Record LinkageAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2242-6_16(194-205)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1007/978-981-97-2242-6_16
Shahbazi NDanevski NNargesian FAsudeh ASrivastava D(2023)Through the Fairness Lens: Experimental Analysis and Evaluation of Entity MatchingProceedings of the VLDB Endowment10.14778/3611479.361152516:11(3279-3292)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611525
Show More Cited By

Recommendations

Deep Entity Matching: Challenges and Opportunities
On the Horizon, On the Horizon and Experience Papers

Entity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to ...
Neural Networks for Entity Matching: A Survey
Entity matching is the problem of identifying which records refer to the same real-world entity. It has been actively researched for decades, and a variety of different approaches have been developed. Even today, it remains a challenging problem, and ...
The Battleship Approach to the Low Resource Entity Matching Problem
PACMMOD

Entity matching, a core data integration problem, is the task of deciding whether two data tuples refer to the same real-world entity. Recent advances in deep learning methods, using pre-trained language models, were proposed for resolving entity ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Data & Knowledge Engineering

Data & Knowledge Engineering Volume 69, Issue 2

February, 2010

80 pages

ISSN:0169-023X

Issue’s Table of Contents

Copyright © Elsevier B.V. © 2009.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 February 2010

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

115
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sohail AQounain W(2024)Locality sensitive blocking (LSB)Journal of Information Science10.1177/0165551522112196350:6(1400-1413)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1177/01655515221121963
Foxcroft JChristen PAntonie L(2024)Class Ratio and Its Implications for Reproducibility and Performance in Record LinkageAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2242-6_16(194-205)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1007/978-981-97-2242-6_16
Shahbazi NDanevski NNargesian FAsudeh ASrivastava D(2023)Through the Fairness Lens: Experimental Analysis and Evaluation of Entity MatchingProceedings of the VLDB Endowment10.14778/3611479.361152516:11(3279-3292)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611525
Boinski PSienkiewicz MWrembel RBebel BAndrzejewski WHong JLanperne MPark JCerny TShahriar H(2023)On evaluating text similarity measures for customer data deduplicationProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3578724(297-300)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3555776.3578724
Boiński PAndrzejewski WBębel BWrembel R(2023)On Tuning the Sorted Neighborhood Method for Record Comparisons in a Data Deduplication PipelineDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_11(164-178)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-39847-6_11
Graf MLaskowski LPapsdorf FSold FGremmelspacher RNaumann FPanse F(2022)FrostProceedings of the VLDB Endowment10.14778/3554821.355482315:12(3292-3305)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.14778/3554821.3554823
Suri SIlyas IRé CRekatsinas T(2022)EmberProceedings of the VLDB Endowment10.14778/3494124.349414915:3(699-712)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3494124.3494149
Jain ASarawagi SSen P(2022)Deep indexed active learning for matching heterogeneous entity representationsProceedings of the VLDB Endowment10.14778/3485450.348545515:1(31-45)Online publication date: 14-Jan-2022
https://dl.acm.org/doi/10.14778/3485450.3485455
Rahman SKandogan E(2022)Characterizing Practices, Limitations, and Opportunities Related to Text Information Extraction Workflows: A Human-in-the-loop PerspectiveProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3502068(1-15)Online publication date: 29-Apr-2022
https://dl.acm.org/doi/10.1145/3491102.3502068
Wrembel R(2022)Data Integration, Cleaning, and Deduplication: Research Versus Industrial ProjectsInformation Integration and Web Intelligence10.1007/978-3-031-21047-1_1(3-17)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.1007/978-3-031-21047-1_1
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents