[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content

A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases

Published: 01 May 2002 Publication History


In modern organizations, decision makers must often be able to quickly access information from diverse sources in order to make timely decisions. A critical problem facing many such organizations is the inability to easily reconcile the information contained in heterogeneous data sources. To overcome this limitation, an organization must resolve several types of heterogeneity problems that may exist across different sources. In this paper, we examine one such problem called the entity heterogeneity problem, which arises when the same real-world entity type is represented using different identifiers in different applications. A decision-theoretic model to resolve the problem is proposed. Our model uses a distance measure to express the similarity between two entity instances. We have implemented the model and tested it on real-world data. The results indicate that the model performs quite well in terms of its ability to predict whether two entity instances should be matched or not. The model is shown to be computationally efficient. It also scales well to large relations from the perspective of the accuracy of prediction. Overall, the test results imply that this is certainly a viable approach in practical situations.


M.R. Anderberg, Cluster Analysis for Applications. New York: Academic Press, 1973.
F.H. Barron and B.E. Barrett, “Decision Quality Using Ranked Attribute Weights,” Management Science, vol. 42, no. 11, pp. 1515-1523, Nov. 1996.
C. Batini and M. Lenzerini, “A Methodology for Data Schema Integration in the Entity Relationship Model,” IEEE Trans. Software Eng., vol. 10, no. 6, pp. 650-664, Nov. 1984.
C. Batini M. Lenzerini and S.B. Navathe, “A Comparative Analysis of Methodologies for Database Schema Integration,” ACM Computing Surveys, vol. 18, no. 4, pp. 323-364, Dec. 1986.
J. Bischoff and T. Alexander, Data Warehouse: Practical Advice from the Experts. Prentice-Hall, 1997.
M.W. Bright A.R. Hurson and S. Pakzad, “Automated Resolution of Semantic Heterogeneity in Multidatabases,” ACM Trans. Database Systems, vol. 19, no. 2, pp. 212-253, June 1994.
J.R. Canada and W.G. Sullivan, Economic and Multiattribute Evaluation of Advanced Manufacturing Systems. Englewood Cliffs, N.J.: Prentice-Hall, 1989.
A. Chatterjee and A. Segev, “Rule Based Joins in Heterogeneous Databases,” Decision Support Systems, vol. 13, no. 1, pp. 313-333, 1995.
A.L.P. Chen P.S.M. Tsai and J.-L. Koh, “Identifying Object Isomerism in Multidatabase Systems,” Distributed and Parallel Databases, vol. 4, no. 2, pp. 143-168, Apr. 1996.
W.W. Cohen, “Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity,” Proc. 1998 ACM SIGMOD Conf., pp. 201-212, June 1998.
J.B. Copas and F.J. Hilton, “Record Linkage: Statistical Models for Matching Computer Records,” J. Royal Statistical Soc., vol. 153, no. 3, pp. 287-320, 1990.
D. Dey S. Sarkar and P. De, “A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases,” Management Science, vol. 44, no. 10, pp. 1379-1395, 1998.
R.T. Eckenrode, “Weighting Multiple Criteria,” Management Science, vol. 12, no. 3, pp. 180-192, Nov. 1965.
I.P. Fellegi and A.B. Sunter, “A Theory of Record Linkage,” Am. Statistical Assoc. J., vol. 64, pp. 1183-1210, Dec. 1969.
P. Fankhauser and E.J. Neuhold, “Knowledge Based Integration of Heterogeneous Databases,” Interoperable Database Systems (DS-5), D.K. Hsiao, E.J. Neuhold, and R. Sacks-Davis eds. pp. 155-175, North-Holland, 1993.
S. Gass, “A Process to Determine Priorities and Weights for Large-Scale Linear Goal Programming,” Proc. 12th Int'l Symp. Math. Programming, Aug. 1985.
W. Gotthard P.C. Lockemann and A. Neufeld, “System-Guided View Integration for Object-Oriented Databases,” IEEE Trans. Knowledge and Data Eng., vol. 4, no. 1, pp. 1-22, Feb. 1992.
J.A. Hartigan, Clustering Algorithms. John Wiley, 1975.
M.A. Hernández and S.J. Stolfo, “The Merge/Purge Problem for Large Databases,” Proc. 1995 ACM SIGMOD Conf., pp. 127-138, May 1995.
R.A. Johnson and W.W. Wichern, Applied Multivariate Statistical Analysis. Englewood Cliffs, N. J.: Prentice-Hall, 1989.
W. Kent R. Ahmed M. Ketabchi and M.-C. Shan, “Object Identification in Multidatabase Systems,” Interoperable Database Systems (DS-5), D.K. Hsiao, E.J. Neuhold, and R. Sacks-Davis eds., pp. 313-330, 1993.
W. Kim I. Choi S. Gala and M. Scheevel, “On Resolving Semantic Heterogeneity in Multidatabase Systems,” Distributed and Parallel Databases, vol. 1, no. 3, pp. 251-279, July 1993.
W. Kim and J. Seo, “Classifying Schematic and Data Heterogeneity in Multidatabase Systems,” IEEE Computer, vol. 24, no. 12, pp. 12-18, Dec. 1991.
D.E. Knuth J.H. Morris and V.R. Pratt, “Fast Pattern Matching in Strings,” SIAM J. Computing, vol. 6, no. 2, pp. 323-350, June 1977.
I. Kononenko and I. Bratco, “Information-Based Evaluation Criterion for Classifier's Performance,” Machine Learning, vol. 6, pp. 67-80, 1991.
H.W. Kuhn, “The Hungarian Method for the Assignment Algorithm,” Naval Research Logistics Quarterly, vol. 1, nos. 1/2, pp. 83-97, Mar./June 1955.
J.A. Larson S.B. Navathe and R. Elmasri, “A Theory of Attribute Equivalence in Databases with Application to Schema Integration,” IEEE Trans. Software Eng, vol. 15, no. 4, pp. 449-463, Apr. 1989.
E. Lawler, Combinatorial Optimization: Networks and Matroids. New York: Holt, Rinehart and Winston, 1976.
W.-S. Li and C. Clifton, “Semantic Integration in Heterogeneous Databases Using Neural Networks,” Proc. 20th Int'l Conf. Very Large Data Bases, pp. 1-12, Sept. 1994.
A.E. Monge and C.P. Elkan, “The Field Matching Problem: Algorithm and Applications,” Proc. Second Int'l Conf. Knowledge Discovery and Data Mining, pp. 267-270, Aug. 1996.
S.B. Navathe and S.G. Gadgil, “A Methodology for View Integration in Logical Database Design,” Proc. Eighth Int'l Conf. Very Large Data Bases, pp. 142-164, Sept. 1982.
H.B. Newcombe J.M. Kennedy S.J. Axford and A.P. James, “Automatic Linkage of Vital Records,” Science, vol. 130, no. 3381, pp. 954-959, Oct. 1959.
C.H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, 1982.
V. Poe, Building a Data Warehouse for Decision Support. Upper Saddle River, N.J.: Prentice-Hall 1996.
M. Rusinkiewicz A. Sheth and G. Karabatis, “Specifying Interdatabase Dependencies in a Multidatabase Environment,” IEEE Computer, vol. 24, no. 12, pp. 46-53, Dec. 1991.
T.L. Saaty, The Analytic Hierarchy Process. New York: McGraw-Hill, 1980.
A.P. Sheth and J.A. Larson, “Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases,” ACM Computing Surveys, vol. 22, no. 3, pp. 183-236, Sept. 1990.
E. Stickel A. Hunstock A. Ortmann and J. Ortmann, “Data Sharing Economics and Requirements for Integration Tool Design,” Information Sciences, vol. 19, no. 8, pp. 629-642, Aug. 1994.
A. Tversky and D. Kahnemann, “Judgment under Uncertainty: Heuristics and Biases,” Science, vol. 185, pp. 1124-1131, 1974.
V. Ventrone and S. Heiler, “Semantic Heterogeneity as a Result of Domain Evolution,” ACM SIGMOD Record, vol. 20, no. 4, pp. 16-20, Dec. 1991.
Y.R. Wang and S. Madnick, “The Interdatabase Instance Identification Problem in Integrating Autonomous Systems,” Proc. Fifth Int'l Conf. Data Eng., pp. 46-55, Feb. 1989.

Cited By

View all



Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors


Published In

cover image IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering  Volume 14, Issue 3
May 2002
206 pages


IEEE Educational Activities Department

United States

Publication History

Published: 01 May 2002

Author Tags

  1. Heterogeneous databases
  2. distance measure
  3. entity heterogeneity
  4. entity matching


  • Research-article


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics


Cited By

View all
  • (2024)LSH SimilarityJoin Pattern in FastFlowInternational Journal of Parallel Programming10.1007/s10766-024-00772-152:3(207-230)Online publication date: 1-Jun-2024
  • (2022)A Hybrid Approach to Discover Entity SynonymsInternational Journal of Information Retrieval Research10.4018/IJIRR.30029612:3(1-18)Online publication date: 26-Aug-2022
  • (2022)Towards a Scalable Set Similarity Join Using MapReduce and LSHComputational Science – ICCS 202210.1007/978-3-031-08751-6_41(569-583)Online publication date: 21-Jun-2022
  • (2018)Set similarity joins on mapreduceProceedings of the VLDB Endowment10.14778/3231751.323176011:10(1110-1122)Online publication date: 1-Jun-2018
  • (2015)Efficient identity matching using static pruning q-gram indexing approachDecision Support Systems10.1016/j.dss.2015.02.01573:C(97-108)Online publication date: 1-May-2015
  • (2014)State-of-the-art in string similarity search and joinACM SIGMOD Record10.1145/2627692.262770643:1(64-76)Online publication date: 13-May-2014
  • (2011)Ontology and instance matchingKnowledge-driven multimedia information extraction and ontology evolution10.5555/2001069.2001076(167-195)Online publication date: 1-Jan-2011
  • (2011)Identity matching using personal and social identity featuresInformation Systems Frontiers10.1007/s10796-010-9270-013:1(101-113)Online publication date: 1-Mar-2011
  • (2010)Preserving privacy whilst integrating data: Applied to criminal justiceInformation Polity10.5555/1858974.185898315:1,2(125-138)Online publication date: 1-Apr-2010
  • (2010)A Survey on Uncertainty Management in Data IntegrationJournal of Data and Information Quality10.1145/1805286.18052912:1(1-33)Online publication date: 1-Jul-2010
  • Show More Cited By

View Options

View options







Share this Publication link

Share on social media