poster

Probabilistic correlation-based similarity measure of unstructured records

Authors:

Shaoxu Song,

Lei ChenAuthors Info & Claims

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Pages 967 - 970

https://doi.org/10.1145/1321440.1321587

Published: 06 November 2007 Publication History

Get Access

Abstract

Computing the similarity between unstructured records is a fundamental function in multiple applications. Approximate string matching and full text retrieval techniques do not show the best performance when applied directly, since the information are limited in unstructured records of short record length. In this paper, we propose a novel probabilistic correlation-based similarity measure. Rather than simply conducting the exact matching tokens of two records, our similarity evaluation enriches the information of records by considering the correlations of tokens. We define the probabilistic correlation between tokens as the probability that these tokens appear in the same records. Then we compute the weight of tokens and discover the correlations of records based on the probabilistic correlations of tokens. Finally, we present extensive experimental results to demonstrate the effectiveness of our approach.

References

[1]

W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD '98, pages 201--212, 1998.

Digital Library

Google Scholar

[2]

L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an rdbms for web data integration. In WWW '03, pages 90--101, 2003.

Digital Library

Google Scholar

[3]

A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD '00, pages 169--178, 2000.

Digital Library

Google Scholar

[4]

G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001.

Digital Library

Google Scholar

[5]

S. Robertson. Understanding inverse document frequency: On theoretical argument for idf. Journal of Documentation, 60(5):503--520, 2004.

Crossref

Google Scholar

[6]

G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.

Digital Library

Google Scholar

[7]

K. Sparck Jones. Index term weighting. Information Storage and Retrieval, 9(11):619--633, 1973.

Crossref

Google Scholar

[8]

E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191--211, 1992.

Digital Library

Google Scholar

[9]

C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 1979.

Digital Library

Google Scholar

Cited By

View all

Li XTalburt JLi TLiu X(2021)When Entity Resolution Meets Deep Learning, Is Similarity Measure Necessary?Advances in Artificial Intelligence and Applied Cognitive Computing10.1007/978-3-030-70296-0_10(127-140)Online publication date: 15-Oct-2021
https://doi.org/10.1007/978-3-030-70296-0_10
Gao FSong SChen LWang J(2016)Efficient Set-Correlation Operator Inside DatabasesJournal of Computer Science and Technology10.1007/s11390-016-1657-z31:4(683-701)Online publication date: 8-Jul-2016
https://doi.org/10.1007/s11390-016-1657-z

Index Terms

Probabilistic correlation-based similarity measure of unstructured records
1. Information systems
  1. Information systems applications

Recommendations

Efficient duplicate record detection based on similarity estimation
WAIM'10: Proceedings of the 11th international conference on Web-age information management

In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on ...
Registering and searching for records in electronic records management systems

Records are increasingly becoming electronic. Both public and private organizations are more and more making information and records only available to their employees, customers and constituents on such a format. Electronic Records Management Systems (...
Similarity-Based Correlation Functions for Binary Data
Advances in Computational Intelligence
Abstract
The purpose of this study is to survey the correlation and association coefficients introduced previously on the set of binary n-tuples and to determine coefficients satisfying the properties of correlation functions. These functions were recently ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

November 2007

1048 pages

ISBN:9781595938039

DOI:10.1145/1321440

Co-chair:
Alberto H. F. Laender,
Conference Chairs:
André O. Falcão
Universidade de Lisboa, Portugal
,
Øystein Haug Olsen,
General Chair:
Mário J. Silva
(Universidade de Lisboa, Portugal)
,
Program Chairs:
Ricardo Baeza-Yates,
Deborah L. McGuinness,
Bjorn Olstad

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

CIKM07

Sponsor:

CIKM07: Conference on Information and Knowledge Management

November 6 - 10, 2007

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
339
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Li XTalburt JLi TLiu X(2021)When Entity Resolution Meets Deep Learning, Is Similarity Measure Necessary?Advances in Artificial Intelligence and Applied Cognitive Computing10.1007/978-3-030-70296-0_10(127-140)Online publication date: 15-Oct-2021
https://doi.org/10.1007/978-3-030-70296-0_10
Gao FSong SChen LWang J(2016)Efficient Set-Correlation Operator Inside DatabasesJournal of Computer Science and Technology10.1007/s11390-016-1657-z31:4(683-701)Online publication date: 8-Jul-2016
https://doi.org/10.1007/s11390-016-1657-z

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Efficient duplicate record detection based on similarity estimation

Registering and searching for records in electronic records management systems

Similarity-Based Correlation Functions for Binary Data