[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1401890.1401931acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Unsupervised deduplication using cross-field dependencies

Published: 24 August 2008 Publication History

Abstract

Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent -- because venues tend to focus on a few research areas -- but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a cross-field dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models cross-field dependence. Our model uses a single set of latent variables to control two disparate clustering models: a Dirichlet-multinomial model over titles, and a non-exchangeable string-edit model over venues. We show that modeling cross-field dependence yields a substantial improvement in performance -- a 58% reduction in error over a standard Dirichlet process mixture.

References

[1]
A. Bagga and B. Baldwin. Algorithms for scoring coreference chains. In Proceedings of MUC7, 1998.
[2]
I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In SIAM Conference on Data Mining (SDM), 2006.
[3]
I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1):1--36, March 2007.
[4]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993, 2003.
[5]
P. Carbonetto, J. Kisynski, N. de Freitas, and D. Poole. Nonparametric Bayesian logic. In Conference on Uncertainty in Artificial Intelligence (UAI), 2005.
[6]
C. Chemudugunta, P. Smyth, and M. Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 241--248. MIT Press, Cambridge, MA, 2007.
[7]
A. Culotta and A. McCallum. Joint deduplication of multiple record types in relational data. In CIKM, pages 257--258, 2005.
[8]
D. B. Dahl. Sequentially-allocated merge-split sampler for conjugate and nonconjugate dirichlet process mixture models. Journal of Computational and Graphical Statistics, 2005.
[9]
A. Haghighi and D. Klein. Unsupervised coreference resolution in a nonparametric Bayesian model. In ACL, 2007.
[10]
A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 905--912. MIT Press, Cambridge, MA, 2005.
[11]
B. Milch. Probabilistic Models with Unknown Objects. PhD thesis, University of California, Berkeley, 2006.
[12]
R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249--265, 2000.
[13]
H. M. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2003.
[14]
P. Singla and P. Domingos. Multi-relational record linkage. In KDD-2004 Workshop on Multi-Relational Data Mining, pages 31--48, 2004.
[15]
P. Singla and P. Domingos. Entity resolution with markov logic. In Sixth International Conference on Data Mining, pages 572--582, 2006.
[16]
Y. W. Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 985--992, 2006.
[17]
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566--1581, 2006.
[18]
B. Wellner, A. McCallum, F. Peng, and M. Hay. An integrated, conditional model of information extraction and coreference with application to citation graph construction. In 20th Conference on Uncertainty in Artificial Intelligence (UAI), 2004.

Cited By

View all
  • (2024)Better entity matching with transformers through ensemblesKnowledge-Based Systems10.1016/j.knosys.2024.111678293:COnline publication date: 7-Jun-2024
  • (2023)A cross-linguistic entity alignment method based on graph convolutional neural network and graph attention networkComputing10.1007/s00607-023-01178-6105:10(2293-2310)Online publication date: 27-May-2023
  • (2022)Coarse-to-Fine Entity Alignment for Chinese Heterogeneous Encyclopedia Knowledge BaseFuture Internet10.3390/fi1402003914:2(39)Online publication date: 25-Jan-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
  • General Chair:
  • Ying Li,
  • Program Chairs:
  • Bing Liu,
  • Sunita Sarawagi
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data mining
  2. deduplication
  3. dirichlet process mixture
  4. information extraction

Qualifiers

  • Research-article

Conference

KDD08

Acceptance Rates

KDD '08 Paper Acceptance Rate 118 of 593 submissions, 20%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Better entity matching with transformers through ensemblesKnowledge-Based Systems10.1016/j.knosys.2024.111678293:COnline publication date: 7-Jun-2024
  • (2023)A cross-linguistic entity alignment method based on graph convolutional neural network and graph attention networkComputing10.1007/s00607-023-01178-6105:10(2293-2310)Online publication date: 27-May-2023
  • (2022)Coarse-to-Fine Entity Alignment for Chinese Heterogeneous Encyclopedia Knowledge BaseFuture Internet10.3390/fi1402003914:2(39)Online publication date: 25-Jan-2022
  • (2022)SiBERT: A Siamese-based BERT network for Chinese medical entities alignmentMethods10.1016/j.ymeth.2022.07.003205(133-139)Online publication date: Sep-2022
  • (2021)Essentials of data deduplication using open-source toolkitData Deduplication Approaches10.1016/B978-0-12-823395-5.00017-3(125-151)Online publication date: 2021
  • (2020)A node resistance-based probability model for resolving duplicate named entitiesScientometrics10.1007/s11192-020-03585-4Online publication date: 13-Jul-2020
  • (2018)An Effective Standardization Method for the Lab Indicators in Regional Medical Health Platform Using N-grams and Stacking2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM.2018.8621274(1602-1609)Online publication date: Dec-2018
  • (2017)ScLinkJournal of Intelligent Information Systems10.1007/s10844-016-0426-348:3(519-551)Online publication date: 1-Jun-2017
  • (2015)Holistic entity matching across knowledge graphsProceedings of the 2015 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2015.7363924(1585-1590)Online publication date: 29-Oct-2015
  • (2014)A case study for understanding the nature of redundant entities in bibliographic digital librariesProgram10.1108/PROG-07-2012-003748:3(246-271)Online publication date: Jul-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media