[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Frameworks for entity matching: A comparison

Published: 01 February 2010 Publication History

Abstract

Entity matching is a crucial and difficult task for data integration. Entity matching frameworks provide several methods and their combination to effectively solve different match tasks. In this paper, we comparatively analyze 11 proposed frameworks for entity matching. Our study considers both frameworks which do or do not utilize training data to semi-automatically find an entity matching strategy to solve a given match task. Moreover, we consider support for blocking and the combination of different match algorithms. We further study how the different frameworks have been evaluated. The study aims at exploring the current state of the art in research prototypes of entity matching frameworks and their evaluations. The proposed criteria should be helpful to identify promising framework approaches and enable categorizing and comparatively assessing additional entity matching frameworks and their evaluations.

References

[1]
A. Arasu, V. Ganti, R. Kaushik, Efficient exact set-similarity joins, in: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB '06), 2006, pp. 918-929.
[2]
A. Arasu, C. Ré, D. Suciu, Large-scale deduplication with constraints using dedupalog, in: Proceedings of the 25th International Conference on Data Engineering (ICDE '09), 2009, pp. 952-963.
[3]
Batini, C. and Scannapieco, M., Data Quality: Concepts, Methodologies and Techniques, Data-Centric Systems and Applications. 2006. Springer.
[4]
R. Baxter, P. Christen, T. Churches, A comparison of fast blocking methods for record linkage, in: Proceedings of the Ninth ACM SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, 2003, pp. 25-27.
[5]
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E. and Widom, J., Swoosh: a generic approach to entity resolution. VLDB J. v18 i1. 255-276.
[6]
I. Bhattacharya, L. Getoor, A latent dirichlet model for unsupervised entity resolution, in: Proceedings of the Sixth SIAM International Conference on Data Mining (SDM '06), 2006, pp. 47-58.
[7]
Bhattacharya, I. and Getoor, L., Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data. v1 i1. 5
[8]
Bhattacharya, I. and Getoor, L., Query-time entity resolution. J. Artif. Intell. Res. (JAIR). v30. 621-657.
[9]
M. Bilenko, S. Basu, M. Sahami, Adaptive product normalization: using online learning for record linkage in comparison shopping, in: Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM '05), 2005, pp. 58-65.
[10]
M. Bilenko, B. Kamath, R.J. Mooney, Adaptive blocking: learning to scale up record linkage, in: Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM '06), 2006, pp. 87-96.
[11]
M. Bilenko, R.J. Mooney, Adaptive duplicate detection using learnable string similarity measures, in: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '03), 2003, pp. 39-48.
[12]
M. Bilenko, R.J. Mooney, On evaluation and training-set construction for duplicate detection, in: Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003, pp. 7-12.
[13]
S. Chaudhuri, B.-C. Chen, V. Ganti, R. Kaushik, Example-driven design of efficient record matching queries, in: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07), 2007, pp. 327-338.
[14]
Chaudhuri, S., Ganti, V. and Xin, D., Mining document collections to facilitate accurate approximate entity matching. PVLDB. v2 i1. 395-406.
[15]
Z. Chen, D.V. Kalashnikov, S. Mehrotra, Exploiting relationships for object consolidation, in: Proceedings of the International Workshop on Information Quality in Information Systems (IQIS '05), 2005, pp. 47-58.
[16]
Z. Chen, D.V. Kalashnikov, S. Mehrotra, Exploiting context analysis for combining multiple entity resolution systems, in: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD '09), 2009, pp. 207-218.
[17]
P. Christen, Automatic training example selection for scalable unsupervised record linkage, in: Proceedings of the 12th Pacific-Asia on Conference on Knowledge Discovery and Data Mining (PAKDD '08), 2008, pp. 511-518.
[18]
P. Christen, FEBRL: a freely available record linkage system with a graphical user interface, in: Proceedings of the Second Australasian workshop on Health Data and Knowledge Management (HDKM '08), Australian Computer Society Inc., Darlinghurst, Australia, Australia, 2008, pp. 17-25.
[19]
W.W. Cohen, H.A. Kautz, D.A. McAllester, Hardening soft information sources, in: Proceedings of the Sixth ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '00), 2000, pp. 255-259.
[20]
W.W. Cohen, P. Ravikumar, S.E. Fienberg, A comparison of string distance metrics for name-matching tasks, in: Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb '03), 2003, pp. 73-78.
[21]
A. Culotta, A. McCallum, Joint deduplication of multiple record types in relational data, in: Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management (CIKM '05), 2005, pp. 257-258.
[22]
X. Dong, A.Y. Halevy, J. Madhavan, Reference reconciliation in complex information spaces, in: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD '05), 2005, pp. 85-96.
[23]
U. Draisbach, F. Naumann, A comparison and generalization of blocking and windowing algorithms for duplicate detection, in: Proceedings of QDB 2009 Workshop at VLDB, 2009.
[24]
M.G. Elfeky, A.K. Elmagarmid, V.S. Verykios, TAILOR: a record linkage tool box, in: Proceedings of the 18th International Conference on Data Engineering (ICDE '02), 2002, pp. 17-28.
[25]
Elmagarmid, A.K., Ipeirotis, P.G. and Verykios, V.S., Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. v19 i1. 1-16.
[26]
Fellegi, I.P. and Sunter, A.B., A theory for record linkage. J. Am. Stat. Assoc. v64 i328. 1183-1210.
[27]
H. Galhardas, D. Florescu, D. Shasha, E. Simon, AJAX: an extensible data cleaning tool, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD '00), 2000, p. 590.
[28]
L. Gu, R. Baxter, D. Vickers, C. Rainsford, Record linkage: current practice and future directions, Tech. Rep., CSIRO Mathematical and Information Sciences, 2003.
[29]
S. Guha, N. Koudas, A. Marathe, D. Srivastava, Merging the results of approximate match operations, in: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB '04), 2004, pp. 636-647.
[30]
M. Hadjieleftheriou, A. Chandel, N. Koudas, D. Srivastava, Fast indexes and algorithms for set similarity selection queries, in: Proceedings of the 24th International Conference on Data Engineering (ICDE '08), 2008, pp. 267-276.
[31]
M. Hadjieleftheriou, C. Li, Efficient approximate search on string collections, Seminar Given at ICDE, 2009.
[32]
Hassanzadeh, O., Chiang, F., Miller, R.J. and Lee, H.C., Framework for evaluating clustering algorithms in duplicate detection. PVLDB. v2 i1. 1282-1293.
[33]
M.A. Hernández, S.J. Stolfo, The merge/purge problem for large databases, in: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD '95), 1995, pp. 127-138.
[34]
Kang, H., Getoor, L., Shneiderman, B., Bilgic, M. and Licamele, L., Interactive entity resolution in relational data: a visual analytic tool and its evaluation. IEEE Trans. Vis. Comput. Graph. v14 i5. 999-1014.
[35]
N. Koudas, S. Sarawagi, D. Srivastava, Record linkage: similarity measures and algorithms, in: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD '06), 2006, pp. 802-803.
[36]
H. Köpcke, E. Rahm, Training selection for tuning entity matching, in: Proceedings of the Sixth International Workshop on Quality in Databases and Management of Uncertain Data (QDB/MUD '08), 2008, pp. 3-12.
[37]
M.-L. Lee, T.W. Ling, W.L. Low, Intelliclean: a knowledge-based intelligent data cleaner, in: Proceedings of the Sixth ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '00), 2000, pp. 290-294.
[38]
L. Leitão, P. Calado, M. Weis, Structure-based inference of XML similarity for fuzzy duplicate detection, in: Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM '07), 2007, pp. 293-302.
[39]
R. Lengu, P. Missier, A.A.A. Fernandes, G. Guerrini, M. Mesiti, Time-completeness trade-offs in record linkage using adaptive query processing, in: Proceedings of the 12th International Conference on Extending Database Technology (EDBT '09), ACM, New York, NY, USA, 2009, pp. 851-861.
[40]
A. McCallum, K. Nigam, L.H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, in: Proceedings of the Sixth ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '00), 2000, pp. 169-178.
[41]
A. McCallum, B. Wellner, Conditional models of identity uncertainty with application to noun coreference, in: Advances in Neural Information Processing Systems, vol. 17. 2004, pp. 905-912.
[42]
M. Michelson, C.A. Knoblock, Learning blocking schemes for record linkage, in: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI '06), 2006.
[43]
S. Minton, C. Nanjo, C.A. Knoblock, M. Michalowski, M. Michelson, A heterogeneous field matching method for record linkage, in: Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM '05), 2005, pp. 314-321.
[44]
M. Neiling, S. Jurk, H.J. Lenz, F. Naumann, Object identification quality, in: Proceedings of the International Workshop on Data Quality in Cooperative Information Systsems (DQCIS '03), 2003, pp. 187-198.
[45]
Newcombe, H.B., Kennedy, J.M., Axford, S.J. and James, A.P., Automatic linkage of vital records. Science. v130. 954-959.
[46]
J.C. Pinheiro, D.X. Sun, Methods for linking and mining massive heterogeneous databases, in: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD '98), 1998, pp. 309-313.
[47]
Rahm, E. and Do, H.H., Data cleaning: problems and current approaches. IEEE Data Eng. Bull. v23 i4. 3-13.
[48]
V. Raman, J.M. Hellerstein, Potter's wheel: an interactive data cleaning system, in: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB '01), 2001, pp. 381-390.
[49]
S. Sarawagi, A. Bhamidipaty, Interactive deduplication using active learning, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '02), 2002, pp. 269-278.
[50]
S. Sarawagi, A. Kirpal, Efficient set joins on similarity predicates, in: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD '04), 2004, pp. 743-754.
[51]
P. Singla, P. Domingos, Multi-relational record linkage, in: Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining, 2004, pp. 31-48.
[52]
P. Singla, P. Domingos, Entity resolution with markov logic, in: Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM '06), 2006, pp. 572-582.
[53]
Tejada, S., Knoblock, C.A. and Minton, S., Learning object identification rules for information integration. Inf. Syst. v26 i8. 607-633.
[54]
S. Tejada, C.A. Knoblock, S. Minton, Learning domain-independent string transformation weights for high accuracy object identification, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '02), 2002, pp. 350-359.
[55]
A. Thor, E. Rahm, MOMA - a mapping-based object matching system, in: Proceedings of the Third Biennial Conference on Innovative Data Systems Research (CIDR '07), 2007, pp. 247-258.
[56]
Verykios, V.S., Moustakides, G.V. and Elfeky, M.G., A Bayesian decision model for cost optimal record matching. VLDB J. v12 i1. 28-40.
[57]
M. Weis, F. Naumann, Dogmatix tracks down duplicates in XML, in: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD '05), 2005, pp. 431-442.
[58]
M. Weis, F. Naumann, F. Brosy, A duplicate detection benchmark for XML (and relational) data, in: SIGMOD 2006 Workshop on Information Quality for Information Systems (IQIS '06), 2006.
[59]
S.E. Whang, D. Menestrina, G. Koutrika, M. Theobald, H. Garcia-Molina, Entity resolution with iterative blocking, in: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD '09), 2009, pp. 219-232.
[60]
W.E. Winkler, Overview of record linkage and current research directions, Tech. Rep., US Bureau of the Census, Washington, DC, 2006.
[61]
Xiao, C., Wang, W. and Lin, X., Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB. v1 i1. 933-944.
[62]
Zhao, H. and Ram, S., Entity identification for heterogeneous database integration - a Multiple Classifier System approach and empirical evaluation. Inf. Syst. v30 i2. 119-132.
[63]
Zhao, H. and Ram, S., Entity matching across heterogeneous data sources: an approach based on constrained cascade generalization. Data Knowl. Eng. v66 i3. 368-381.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Data & Knowledge Engineering
Data & Knowledge Engineering  Volume 69, Issue 2
February, 2010
80 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 February 2010

Author Tags

  1. Entity matching
  2. Entity resolution
  3. Match optimization
  4. Matcher combination
  5. Training selection

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Locality sensitive blocking (LSB)Journal of Information Science10.1177/0165551522112196350:6(1400-1413)Online publication date: 1-Dec-2024
  • (2024)Class Ratio and Its Implications for Reproducibility and Performance in Record LinkageAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2242-6_16(194-205)Online publication date: 7-May-2024
  • (2023)Through the Fairness Lens: Experimental Analysis and Evaluation of Entity MatchingProceedings of the VLDB Endowment10.14778/3611479.361152516:11(3279-3292)Online publication date: 24-Aug-2023
  • (2023)On evaluating text similarity measures for customer data deduplicationProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3578724(297-300)Online publication date: 27-Mar-2023
  • (2023)On Tuning the Sorted Neighborhood Method for Record Comparisons in a Data Deduplication PipelineDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_11(164-178)Online publication date: 28-Aug-2023
  • (2022)FrostProceedings of the VLDB Endowment10.14778/3554821.355482315:12(3292-3305)Online publication date: 1-Aug-2022
  • (2022)EmberProceedings of the VLDB Endowment10.14778/3494124.349414915:3(699-712)Online publication date: 4-Feb-2022
  • (2022)Deep indexed active learning for matching heterogeneous entity representationsProceedings of the VLDB Endowment10.14778/3485450.348545515:1(31-45)Online publication date: 14-Jan-2022
  • (2022)Characterizing Practices, Limitations, and Opportunities Related to Text Information Extraction Workflows: A Human-in-the-loop PerspectiveProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3502068(1-15)Online publication date: 29-Apr-2022
  • (2022)Data Integration, Cleaning, and Deduplication: Research Versus Industrial ProjectsInformation Integration and Web Intelligence10.1007/978-3-031-21047-1_1(3-17)Online publication date: 28-Nov-2022
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media