[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2488388.2488415acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Optimal hashing schemes for entity matching

Published: 13 May 2013 Publication History

Abstract

In this paper, we consider the problem of devising blocking schemes for entity matching. There is a lot of work on blocking techniques for supporting various kinds of predicates, e.g. exact matches, fuzzy string-similarity matches, and spatial matches. However, given a complex entity matching function in the form of a Boolean expression over several such predicates, we show that it is an important and non-trivial problem to combine the individual blocking techniques into an efficient blocking scheme for the entity matching function, a problem that has not been studied previously.
In this paper, we make fundamental contributions to this problem. We consider an abstraction for modeling complex entity matching functions as well as blocking schemes. We present several results of theoretical and practical interest for the problem. We show that in general, the problem of computing the optimal blocking strategy is NP-hard in the size of the DNF formula describing the matching function. We also present several algorithms for computing the exact optimal strategies (with exponential complexity, but often feasible in practice) as well as fast approximation algorithms. We experimentally demonstrate over commercially used rule-based matching systems over real datasets at Yahoo!, as well as synthetic datasets, that our blocking strategies can be an order of magnitude faster than the baseline methods, and our algorithms can efficiently find good blocking strategies.

References

[1]
Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.
[2]
Arvind Arasu, Michaela Götz, and Raghav Kaushik. On active learning of record matching packages. In SIGMOD Conference, pages 783--794, 2010.
[3]
Brian Babcock and Surajit Chaudhuri. Towards a robust query optimizer: a principled and practical approach. In SIGMOD, pages 119--130, 2005.
[4]
Ziv Bar-Yossef, Ravi Kumar, and D. Sivakumar. Sampling algorithms: lower bounds and applications. In STOC, pages 266--275, 2001.
[5]
Indrajit Bhattacharya and Lise Getoor. A latent dirichlet model for unsupervised entity resolution. In SIAM Conference on Data Mining (SDM), 2006.
[6]
Indrajit Bhattacharya and Lise Getoor. Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data, 1(1), 2007.
[7]
Mikhail Bilenko, Beena Kamath, and Raymond J. Mooney. Adaptive blocking: Learning to scale up record linkage and clustering. In ICDM, 2006.
[8]
Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.
[9]
V. Chvatal. A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4:233--235, Aug 1979.
[10]
William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IJCAI Workshop on Information Integration on the Web, pages 73--78, 2003.
[11]
Pedro Domingos. Multi-relational record linkage. In In Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining, pages 31--48, 2004.
[12]
Xin Dong, Alon Halevy, and Jayant Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, pages 85--96, 2005.
[13]
I. P. Fellegi and A. B. Sunter. A theory for record linkage. In Journal of the American Statistical Society, volume 64, pages 1183--1210, 1969.
[14]
Rahul Gupta and Sunita Sarawagi. Creating probabilistic databases from information extraction models. In VLDB, pages 965--976, 2006.
[15]
Rob Hall, Charles Sutton, and Andrew McCallum. Unsupervised deduplication using cross-field dependencies. In KDD, pages 310--317, 2008.
[16]
Junfeng He, Wei Liu, and Shih-Fu Chang. Scalable similarity search with optimized kernel hashing. In KDD, pages 1129--1138, 2010.
[17]
Mauricio A. Hernández and Salvatore J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov., 2:9?37, January 1998.
[18]
Tin Kam Ho. A data complexity analysis of comparative advantages of decision forest constructors. Pattern Anal. Appl., pages 102--112, 2002.
[19]
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, STOC '98, pages 604--613, 1998.
[20]
Edwin H. Jacox and Hanan Samet. Spatial join techniques. ACM Trans. Database Syst., 32, March 2007.
[21]
Matthew A. Jaro. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84(406):414--420, 1989.
[22]
Michael Luby and Avi Wigderson. Pairwise Independence and Derandomization. Now Publishers Inc, 2006.
[23]
Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Knowledge Discovery and Data Mining, pages 169--178, 2000.
[24]
Andrew McCallum and Ben Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, 2004.
[25]
Matthew Michelson and Craig A. Knoblock. Learning blocking schemes for record linkage. In AAAI, pages 440--445, 2006.
[26]
Gonzalo Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001.
[27]
H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. andJames. Automatic Linkage of Vital Records. Science, 130:954--959, October 1959.
[28]
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2002.
[29]
P Christen R Baxter and T Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003.
[30]
Anish Das Sarma, Ankur Jain, Ashwin Machanavajjhala, and Philip Bohannon. An automatic blocking mechanism for large-scale de-duplication tasks. In CIKM, 2012.
[31]
Parag Singla and Pedro Domingos. Entity resolution with markov logic. In icdm, pages 572--582, 2006.
[32]
William Winkler. Approximate string comparator search strategies for very large administrative lists. In Technical Report, Statistical Research Division, U.S. Bureau of the Census, 2005.

Cited By

View all
  • (2024)LRER: A Low-Resource Entity Resolution Framework with Hybrid Information2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651166(1-8)Online publication date: 30-Jun-2024
  • (2024)On tuning parameters guiding similarity computations in a data deduplication pipeline for customers recordsInformation Systems10.1016/j.is.2023.102323121(102323)Online publication date: Mar-2024
  • (2024)Renaissance of Fuzzy and Fast Matching Entity with DSHS AlgorithmSN Computer Science10.1007/s42979-024-03093-95:6Online publication date: 29-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '13: Proceedings of the 22nd international conference on World Wide Web
May 2013
1628 pages
ISBN:9781450320351
DOI:10.1145/2488388

Sponsors

  • NICBR: Nucleo de Informatcao e Coordenacao do Ponto BR
  • CGIBR: Comite Gestor da Internet no Brazil

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. blocking
  2. entity matching
  3. hashing

Qualifiers

  • Research-article

Conference

WWW '13
Sponsor:
  • NICBR
  • CGIBR
WWW '13: 22nd International World Wide Web Conference
May 13 - 17, 2013
Rio de Janeiro, Brazil

Acceptance Rates

WWW '13 Paper Acceptance Rate 125 of 831 submissions, 15%;
Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)LRER: A Low-Resource Entity Resolution Framework with Hybrid Information2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651166(1-8)Online publication date: 30-Jun-2024
  • (2024)On tuning parameters guiding similarity computations in a data deduplication pipeline for customers recordsInformation Systems10.1016/j.is.2023.102323121(102323)Online publication date: Mar-2024
  • (2024)Renaissance of Fuzzy and Fast Matching Entity with DSHS AlgorithmSN Computer Science10.1007/s42979-024-03093-95:6Online publication date: 29-Jul-2024
  • (2023)Domain-Generic Pre-Training for Low-Cost Entity Matching via Domain Alignment and Domain Antagonism2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191296(1-7)Online publication date: 18-Jun-2023
  • (2023)Effective entity matching with transformersThe VLDB Journal10.1007/s00778-023-00779-z32:6(1215-1235)Online publication date: 17-Jan-2023
  • (2022)Can Foundation Models Wrangle Your Data?Proceedings of the VLDB Endowment10.14778/3574245.357425816:4(738-746)Online publication date: 1-Dec-2022
  • (2022)Empowering Transformer with Hybrid Matching Knowledge for Entity MatchingDatabase Systems for Advanced Applications10.1007/978-3-031-00129-1_4(52-67)Online publication date: 11-Apr-2022
  • (2020)Deep entity matching with pre-trained language modelsProceedings of the VLDB Endowment10.14778/3421424.342143114:1(50-60)Online publication date: 1-Sep-2020
  • (2018)Automatic Schema-Independent Linked Data Instance Matching SystemInformation Retrieval and Management10.4018/978-1-5225-5191-1.ch065(1446-1469)Online publication date: 2018
  • (2017)Harnessing Semantic Features for Large-Scale Content-Based Hashtag Recommendations on Microblogging PlatformsInternational Journal on Semantic Web and Information Systems10.4018/IJSWIS.201701010513:1(63-81)Online publication date: Jan-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media