Dynamic constraints for record matching

Wenfei Fan^1,3,
Hong Gao³,
Xibei Jia²,
Jianzhong Li³ &
…
Shuai Ma²

353 Accesses
69 Citations
Explore all metrics

Abstract

This paper investigates constraints for matching records from unreliable data sources. (a) We introduce a class of matching dependencies (mds) for specifying the semantics of unreliable data. As opposed to static constraints for schema design, mds are developed for record matching, and are defined in terms of similarity predicates and a dynamic semantics. (b) We identify a special case of mds, referred to as relative candidate keys (rcks), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring mds, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. Moreover, we develop a sound and complete system for inferring mds. (d) We provide a quadratic-time algorithm for inferring mds and an effective algorithm for deducing a set of high-quality rcks from mds. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing and in addition, that the md-based techniques effectively improve the quality and efficiency of various record matching methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

References

Abiteboul S., Hull R., Vianu V.: Foundations of Databases. Addison-Wesley, Boston (1995)
MATH Google Scholar
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB (2002)
Sarma, J.W.A.D., Ullman, J.: Schema design for uncertain databases. In: Proceedings of the 3rd Alberto Mendelzon Workshop on Foundations of Data Management (2009)
Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE (2008)
Arasu, A., Re, C., Suciu, D.: Large-scale deduplication with constraints using Dedupalog. In: ICDE (2009)
Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: SIGMOD (2005)
Batini C., Scannapieco M.: Data Quality: Concepts, Methodologies and Techniques. Springer, Berlin (2006)
MATH Google Scholar
Beeri C., Bernstein P.A.: Computational problems related to the design of normal form relational schemas. ACM Trans. Database Syst. 4(1), 30–59 (1979)
Article Google Scholar
Belohlávek, R., Vychodil, V.: Data tables with similarity relations: functional dependencies, complete rules and non-redundant bases. In: DASFAA, (2006)
Cautis, B., Abiteboul, S., Milo, T.: Reasoning about XML update constraints. In: PODS, (2007)
Chaudhuri, S., Chen, B.-C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: VLDB (2007)
Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD (2007)
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: KDD (2002)
Dhamankar, R., Lee, Y., Doan, A., Halevy, A.Y., Domingos, P.: iMAP: discovering complex mappings between database schemas. In: SIGMOD (2004)
Elmagarmid A.K., Ipeirotis P.G., Verykios V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Fan, W.: Dependencies revisited for improving data quality. In: PODS (2008)
Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. (2010)
Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. In: VLDB (2009)
Fellegi Ivan, Holt D.: A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71(353), 17–35 (1976)
Article Google Scholar
Fellegi I., Sunter A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Article Google Scholar
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative data cleaning: language, model and algorithms. In: VLDB (2001)
Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match operations. In: VLDB (2004)
Haas, L., Hernández, M., Ho, H., Popa, L., Roth, Mary: Clio grows up: from research prototype to industrial tool. In: SIGMOD (2005)
Hernndez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD (1995)
Hernndez M.A., Stolfo S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)
Article Google Scholar
Huhtala Y., Kärkkäinen J., Porkka P., Toivonen H.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)
Article MATH Google Scholar
http://www.sas.com/industry/fsi/fraud/
http://userweb.cs.utexas.edu/users/ml/riddle/data.html
Jaro M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa Florida. J. Am. Stat. Assoc. 89, 414–420 (1989)
Article Google Scholar
Koudas, N., Saha, A., Srivastava, D., Venkatasubramanian, S.: Metric functional dependencies. In: ICDE (2009)
Lim E.-P., Srivastava J., Prabhakar S., Richardson J.: Entity identification in database integration. Inf. Sci. 89(1–2), 1–38 (1996)
Article Google Scholar
Loshin D.: Master Data Management. Knowledge Integrity, Inc., New York (2009)
Google Scholar
Lucchesi C.L., Osborn S.L.: Candidate keys for relations. J. Comput. Syst. Sci. 17(2), 270–279 (1978)
Article MathSciNet MATH Google Scholar
Maier D.: The Theory of Relational Databases. Computer Science Press, Rockville (1983)
MATH Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. (2001)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD (2002)
Sauter G., Mathews B., Ostic E.: Information Service Patterns, Part 3: Data Cleansing Pattern. IBM, USA (2007)
Google Scholar
Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In AAAI (2005)
Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: PKDD (2005)
Sismanis, Y., Brown, P., Haas, P.J., Reinwald, B.: GORDIAN: efficient and scalable discovery of composite keys. In: VLDB (2006)
Soundex: http://en.wikipedia.org/wiki/Soundex
Verykios V.S., Elmagarmid A.K., Houstis E.: Automating the approximate record-matching process. Inf. Sci. 126(1–4), 83–89 (2002)
Google Scholar
Vianu V.: Dynamic functional dependencies and database aging. J. ACM 34(1), 28–59 (1987)
Article MathSciNet Google Scholar
Weis, M., Naumann, F.: DogmatiX tracks down duplicates in XML. In: SIGMOD (2005)
Weis, M., Naumann, F., Jehle, U., Lufter, J., Schuster, H.: Industry-scale duplicate detection. In: VLDB (2008)
Winkler, W.E.: Methods for record linkage and bayesian networks. Technical report RRS2002/05, U.S. Census Bureau (2002)
Winkler W.E.: Methods for evaluating and creating data quality. Inf. Syst. 29(7), 531–550 (2004)
Article Google Scholar
Yancey, W.: BigMatch: A program for extracting probable matches from a large file. Technical report computing 2007/01, U.S. Census Bureau (2007)

Download references

Author information

Authors and Affiliations

University of Edinburgh, Edinburgh, UK
Wenfei Fan
School of Informatics, University of Edinburgh, Edinburgh, UK
Xibei Jia & Shuai Ma
Harbin Institute of Technology, Harbin, Heilongjiang, China
Wenfei Fan, Hong Gao & Jianzhong Li

Authors

Wenfei Fan
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar
Xibei Jia
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianzhong Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fan, W., Gao, H., Jia, X. et al. Dynamic constraints for record matching. The VLDB Journal 20, 495–520 (2011). https://doi.org/10.1007/s00778-010-0206-6

Download citation

Received: 09 January 2010
Revised: 21 August 2010
Accepted: 07 October 2010
Published: 16 November 2010
Issue Date: August 2011
DOI: https://doi.org/10.1007/s00778-010-0206-6

Dynamic constraints for record matching

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Contextual Keys

A systematic approach to reliability assessment in integrated databases

Embedded Cardinality Constraints

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Dynamic constraints for record matching

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Contextual Keys

A systematic approach to reliability assessment in integrated databases

Embedded Cardinality Constraints

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation