[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Dynamic constraints for record matching

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

This paper investigates constraints for matching records from unreliable data sources. (a) We introduce a class of matching dependencies (mds) for specifying the semantics of unreliable data. As opposed to static constraints for schema design, mds are developed for record matching, and are defined in terms of similarity predicates and a dynamic semantics. (b) We identify a special case of mds, referred to as relative candidate keys (rcks), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring mds, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. Moreover, we develop a sound and complete system for inferring mds. (d) We provide a quadratic-time algorithm for inferring mds and an effective algorithm for deducing a set of high-quality rcks from mds. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing and in addition, that the md-based techniques effectively improve the quality and efficiency of various record matching methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abiteboul S., Hull R., Vianu V.: Foundations of Databases. Addison-Wesley, Boston (1995)

    MATH  Google Scholar 

  2. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB (2002)

  3. Sarma, J.W.A.D., Ullman, J.: Schema design for uncertain databases. In: Proceedings of the 3rd Alberto Mendelzon Workshop on Foundations of Data Management (2009)

  4. Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE (2008)

  5. Arasu, A., Re, C., Suciu, D.: Large-scale deduplication with constraints using Dedupalog. In: ICDE (2009)

  6. Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: SIGMOD (2005)

  7. Batini C., Scannapieco M.: Data Quality: Concepts, Methodologies and Techniques. Springer, Berlin (2006)

    MATH  Google Scholar 

  8. Beeri C., Bernstein P.A.: Computational problems related to the design of normal form relational schemas. ACM Trans. Database Syst. 4(1), 30–59 (1979)

    Article  Google Scholar 

  9. Belohlávek, R., Vychodil, V.: Data tables with similarity relations: functional dependencies, complete rules and non-redundant bases. In: DASFAA, (2006)

  10. Cautis, B., Abiteboul, S., Milo, T.: Reasoning about XML update constraints. In: PODS, (2007)

  11. Chaudhuri, S., Chen, B.-C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: VLDB (2007)

  12. Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD (2007)

  13. Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: KDD (2002)

  14. Dhamankar, R., Lee, Y., Doan, A., Halevy, A.Y., Domingos, P.: iMAP: discovering complex mappings between database schemas. In: SIGMOD (2004)

  15. Elmagarmid A.K., Ipeirotis P.G., Verykios V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  16. Fan, W.: Dependencies revisited for improving data quality. In: PODS (2008)

  17. Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. (2010)

  18. Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. In: VLDB (2009)

  19. Fellegi Ivan, Holt D.: A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71(353), 17–35 (1976)

    Article  Google Scholar 

  20. Fellegi I., Sunter A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)

    Article  Google Scholar 

  21. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative data cleaning: language, model and algorithms. In: VLDB (2001)

  22. Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match operations. In: VLDB (2004)

  23. Haas, L., Hernández, M., Ho, H., Popa, L., Roth, Mary: Clio grows up: from research prototype to industrial tool. In: SIGMOD (2005)

  24. Hernndez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD (1995)

  25. Hernndez M.A., Stolfo S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)

    Article  Google Scholar 

  26. Huhtala Y., Kärkkäinen J., Porkka P., Toivonen H.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)

    Article  MATH  Google Scholar 

  27. http://www.sas.com/industry/fsi/fraud/

  28. http://userweb.cs.utexas.edu/users/ml/riddle/data.html

  29. Jaro M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa Florida. J. Am. Stat. Assoc. 89, 414–420 (1989)

    Article  Google Scholar 

  30. Koudas, N., Saha, A., Srivastava, D., Venkatasubramanian, S.: Metric functional dependencies. In: ICDE (2009)

  31. Lim E.-P., Srivastava J., Prabhakar S., Richardson J.: Entity identification in database integration. Inf. Sci. 89(1–2), 1–38 (1996)

    Article  Google Scholar 

  32. Loshin D.: Master Data Management. Knowledge Integrity, Inc., New York (2009)

    Google Scholar 

  33. Lucchesi C.L., Osborn S.L.: Candidate keys for relations. J. Comput. Syst. Sci. 17(2), 270–279 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  34. Maier D.: The Theory of Relational Databases. Computer Science Press, Rockville (1983)

    MATH  Google Scholar 

  35. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. (2001)

  36. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD (2002)

  37. Sauter G., Mathews B., Ostic E.: Information Service Patterns, Part 3: Data Cleansing Pattern. IBM, USA (2007)

    Google Scholar 

  38. Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In AAAI (2005)

  39. Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: PKDD (2005)

  40. Sismanis, Y., Brown, P., Haas, P.J., Reinwald, B.: GORDIAN: efficient and scalable discovery of composite keys. In: VLDB (2006)

  41. Soundex: http://en.wikipedia.org/wiki/Soundex

  42. Verykios V.S., Elmagarmid A.K., Houstis E.: Automating the approximate record-matching process. Inf. Sci. 126(1–4), 83–89 (2002)

    Google Scholar 

  43. Vianu V.: Dynamic functional dependencies and database aging. J. ACM 34(1), 28–59 (1987)

    Article  MathSciNet  Google Scholar 

  44. Weis, M., Naumann, F.: DogmatiX tracks down duplicates in XML. In: SIGMOD (2005)

  45. Weis, M., Naumann, F., Jehle, U., Lufter, J., Schuster, H.: Industry-scale duplicate detection. In: VLDB (2008)

  46. Winkler, W.E.: Methods for record linkage and bayesian networks. Technical report RRS2002/05, U.S. Census Bureau (2002)

  47. Winkler W.E.: Methods for evaluating and creating data quality. Inf. Syst. 29(7), 531–550 (2004)

    Article  Google Scholar 

  48. Yancey, W.: BigMatch: A program for extracting probable matches from a large file. Technical report computing 2007/01, U.S. Census Bureau (2007)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianzhong Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fan, W., Gao, H., Jia, X. et al. Dynamic constraints for record matching. The VLDB Journal 20, 495–520 (2011). https://doi.org/10.1007/s00778-010-0206-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-010-0206-6

Keywords

Navigation