[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Conditional functional dependencies for capturing data inconsistencies

Published: 24 June 2008 Publication History

Abstract

We propose a class of integrity constraints for relational databases, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by enforcing bindings of semantically related values. For static analysis of CFDs we investigate the consistency problem, which is to determine whether or not there exists a nonempty database satisfying a given set of CFDs, and the implication problem, which is to decide whether or not a set of CFDs entails another CFD. We show that while any set of transitional FDs is trivially consistent, the consistency problem is NP-complete for CFDs, but it is in PTIME when either the database schema is predefined or no attributes involved in the CFDs have a finite domain. For the implication analysis of CFDs, we provide an inference system analogous to Armstrong's axioms for FDs, and show that the implication problem is coNP-complete for CFDs in contrast to the linear-time complexity for their traditional counterpart. We also present an algorithm for computing a minimal cover of a set of CFDs. Since CFDs allow data bindings, in some cases CFDs may be physically large, complicating the detection of constraint violations. We develop techniques for detecting CFD violations in SQL as well as novel techniques for checking multiple constraints by a single query. We also provide incremental methods for checking CFDs in response to changes to the database. We experimentally verify the effectiveness of our CFD-based methods for inconsistency detection. This work not only yields a constraint theory for CFDs but is also a step toward a practical constraint-based method for improving data quality.

References

[1]
Abiteboul, S., Hull, R., and Vianu, V. 1995. Foundations of Databases. Addison-Wesley.]]
[2]
Arenas, M., Bertossi, L. E., and Chomicki, J. 2003. Consistent query answers in inconsistent databases. Theory Pract. Logic Program. 3, 4-5, 393--424.]]
[3]
Armstrong, W. W. 1974. Dependency structures of data base relationships. In Proceedings of the IFIP World Computer Congress. 580--583.]]
[4]
Baudinet, M., Chomicki, J., and Wolper, P. 1999. Constraint-generating dependencies. J. Comput. Syst. Sci. 59, 1, 94--115.]]
[5]
Beeri, C. and Bernstein, P. A. 1979. Computational problems related to the design of normal form relational schemas. ACM Trans. Data. Syst. 4, 1, 30--59.]]
[6]
Beeri, C. and Vardi, M. 1984. A proof procedure for data dependencies. J. ACM 31, 4, 718--741.]]
[7]
Bertossi, L. and Chomicki, J. 2003. Query answering in inconsistent databases. In Logics for Emerging Applications of Databases. 43--83.]]
[8]
Bohannon, P., Fan, W., Flaster, M., and Rastogi, R. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the International Conference on Management of Data (SIGMOD). 143--154.]]
[9]
Bohannon, P., Fan, W., Geerts, F., Jia, X., and Kementsietsidis, A. 2007. Conditional functional dependencies for data cleaning. In Proceedings of the International Conference on Data Engineering (ICDE). 746--755.]]
[10]
Bra, P. D. and Paredaens, J. 1983. Conditional dependencies for horizontal decompositions. In Colloquium on Automata, Languages and Programming. 67--82.]]
[11]
Bravo, L. and Bertossi, L. 2003. Logic programs for consistently querying data integration systems. In Proceedings of the International Joint Conference on Artificial Intelligence. 10--15.]]
[12]
Bravo, L., Fan, W., Geerts, F., and Ma, S. 2008. Increasing the expressivity of conditional functional dependencies without extra complexity. In Proceedings of the International Conference on Data Engineering (ICDE).]]
[13]
Bravo, L., Fan, W., and Ma, S. 2007. Extending dependencies with conditions. In Proceedings of the International Conference on Very Large Databases (VLDB). 243--254.]]
[14]
Bruni, R. and Sassano, A. 2001. Errors detection and correction in large scale data collecting. In Proceedings of the International Conference on Advances in Intelligent Data Analysis (IDA). 84--94.]]
[15]
Cali, A., Lembo, D., and Rosati, R. 2003a. On the decidability and complexity of query answering over inconsistent and incomplete databases. In Proceedings of the Symposium on Principles of Database Systems (PODS). 260--271.]]
[16]
Cali, A., Lembo, D., and Rosati, R. 2003b. Query rewriting and answering under constraints in data integration systems. In Proceedings of the International Joint Conference on Artificial Intelligence. 16--21.]]
[17]
Chomicki, J. and Marcinkowski, J. 2005a. Minimal-change integrity maintenance using tuple deletions. Inform. Comput. 197, 1-2, 90--121.]]
[18]
Chomicki, J. and Marcinkowski, J. 2005b. On the computational complexity of minimal-change integrity maintenance in relational databases. In Inconsistency Tolerance. 119--150.]]
[19]
Codd, E. F. 1972. Relational completeness of data base sublanguages. In Database Systems: Courant Computer Science Symposia Series 6. Prentice-Hall, 65--98.]]
[20]
Cong, G., Fan, W., Geerts, F., Jia, X., and Ma, S. 2007. Improving data quality: Consistency and accuracy. In Proceedings of the International Conference on Very Large Databases (VLDB). 315--326.]]
[21]
Eckerson, W. W. 2002. Data quality and the bottom line: Achieving business success through a commitment to high quality data. Tech. rep., The Data Warehousing Institute. http://www.tdwi.org/research/display.aspx?ID=6064.]]
[22]
Fellegi, I. and Holt, D. 1976. A systematic approach to automatic edit and imputation. J. Amer. Statist. Assn. 71, 353, 17--35.]]
[23]
Fellegi, I. P. and Sunter, A. B. 1969. A theory for record linkage. J. Amer. Statist. Assn. 64, 328, 1183--1210.]]
[24]
Franconi, E., Palma, A. L., Leone, N., Perri, S., and Scarcello, F. 2001. Census data repair: a challenging application of disjunctive logic programming. In Proceedings of the Artificial Intelligence on Logic for Programming (LPAR). 561--578.]]
[25]
Galhardas, H., Florescu, D., Shasha, D., and Simon, E. 2000. AJAX: An extensible data cleaning tool. In Proceedings of the International Conference on Management of Data (SIGMOD). 590.]]
[26]
Galhardas, H., Florescu, D., Shasha, D., Simon, E., and Saita, C.-A. 2001. Declarative data cleaning: Language, model and algorithms. In Proceedings of the International Conference on Very Large Databases (VLDB). 371--380.]]
[27]
Garey, M. and Johnson, D. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company.]]
[28]
Garfinkel, R. S., Kunnathur, A. S., and Liepins, G. E. 1986. Optimal imputation of erroneous data: Categorical data, general edits. Operat. Resear. 34, 5, 744--751.]]
[29]
Gertz, M. and Lipeck, U. 1995. A diagnostic approach to repairing constraint violations in databases. In Proceedings of the International Workshop on Principles of Diagnosis (DX). 65--72.]]
[30]
Grahne, G. 1991. The Problem of Incomplete Information in Relational Databases. Springer.]]
[31]
Greco, G., Greco, S., and Zumpano, E. 2003. A logical framework for querying and repairing inconsistent databases. IEEE Trans. Knowl. Data Engin. 15, 6, 1389--1408.]]
[32]
Hernandez, M. A. and Stolfo, S. 1998. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2, 1, 9--37.]]
[33]
Imieliński, T. and Lipski Jr, W. 1984. Incomplete information in relational databases. J. ACM 31, 4, 761--791.]]
[34]
Lim, E.-P., Srivastava, J., Prabhakar, S., and Richardson, J. 1996. Entity identification in database integration. Inform. Sci. 89, 1-2, 1--38.]]
[35]
Maher, M. J. 1997. Constrained dependencies. Theor. Comput. Sci. 173, 1, 113--149.]]
[36]
Maher, M. J. and Srivastava, D. 1996. Chasing constrained tuple-generating dependencies. In Proceedings of the Symposium on Principles of Database Systems (PODS). 128--138.]]
[37]
Maier, D. 1980. Minimum covers in relational database model. J. ACM 27, 4, 664--674.]]
[38]
Maier, D. 1983. The Theory of Relational Databases. Computer Science Press.]]
[39]
Maletic, J. I. and Marcus, A. 2000. Data cleansing: Beyond integrity analysis. In Proceedings of the Conference on Information Quality (IQ). 200--209.]]
[40]
Monge, A. E. 2000. Matching algorithms within a duplicate detection system. IEEE Data Eng. Bull. 23, 4, 14--20.]]
[41]
Papadimitriou, C. H. 1994. Computational Complexity. Addison Wesley.]]
[42]
Rahm, E. and Do, H. H. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4, 3--13.]]
[43]
Raman, V. and Hellerstein, J. M. 2001. Potter's wheel: An interactive data cleaning system. In Proceedings of the International Conference on Very Large Databases (VLDB). 381--390.]]
[44]
Sadri, F. 1980. Data dependencies in the relational model of data: A generalization. PhD thesis, Princeton University.]]
[45]
Sadri, F. and Ullman, J. 1982. Template dependencies: A large class of dependencies in relational databases and its complete axiomatization. J. ACM 29, 2, 363--372.]]
[46]
Shilakes, C. C. and Tylman, J. 1998. Enterprise information portals. Tech. rep., Merrill Lynch, Inc., New York, NY.]]
[47]
Vazirani, V. V. 2003. Approximation Algorithms. Springer.]]
[48]
Wijsen, J. 2005. Database repairing using updates. ACM Trans. Datab. Syst. 30, 3, 722--768.]]
[49]
Winkler, W. E. 1994. Advanced methods for record linkage. Tech. rep., Statistical Research Division, U.S. Bureau of the Census.]]
[50]
Winkler, W. E. 1997. Set-covering and editing discrete data. In Proceedings of the American Statistical Association. Section on Survey Research Methods. 564--569.]]
[51]
Winkler, W. E. 2004. Methods for evaluating and creating data quality. Infor. Syst. 29, 7, 531--550.]]

Cited By

View all
  • (2024)Rock: Cleaning Data with both ML and Logic RulesProceedings of the VLDB Endowment10.14778/3685800.368587817:12(4373-4376)Online publication date: 8-Nov-2024
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • (2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 1-Jun-2024
  • Show More Cited By

Index Terms

  1. Conditional functional dependencies for capturing data inconsistencies

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Database Systems
    ACM Transactions on Database Systems  Volume 33, Issue 2
    June 2008
    309 pages
    ISSN:0362-5915
    EISSN:1557-4644
    DOI:10.1145/1366102
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 June 2008
    Accepted: 01 December 2007
    Revised: 01 September 2007
    Received: 01 February 2007
    Published in TODS Volume 33, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data cleaning
    2. SQL
    3. functional dependency

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)85
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Rock: Cleaning Data with both ML and Logic RulesProceedings of the VLDB Endowment10.14778/3685800.368587817:12(4373-4376)Online publication date: 8-Nov-2024
    • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
    • (2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 1-Jun-2024
    • (2024)Mixed Covers of Keys and Functional Dependencies for Maintaining the Integrity of Data under UpdatesProceedings of the VLDB Endowment10.14778/3654621.365462617:7(1578-1590)Online publication date: 1-Mar-2024
    • (2024)GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language ModelsProceedings of the ACM on Management of Data10.1145/36988112:6(1-29)Online publication date: 20-Dec-2024
    • (2024)Discovering Top-k Relevant and Diversified RulesProceedings of the ACM on Management of Data10.1145/36771312:4(1-28)Online publication date: 30-Sep-2024
    • (2024)IterClean: An Iterative Data Cleaning Framework with Large Language ModelsProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674436(100-105)Online publication date: 5-Jul-2024
    • (2024)BUNNI: Learning Repair Actions in Rule-driven Data CleaningJournal of Data and Information Quality10.1145/366593016:2(1-31)Online publication date: 25-May-2024
    • (2024)Rock: Cleaning Data by Embedding ML in Logic RulesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653372(106-119)Online publication date: 9-Jun-2024
    • (2024)EvoFuzzer: An Evolutionary Fuzzer for Detecting Reentrancy Vulnerability in Smart ContractsIEEE Transactions on Network Science and Engineering10.1109/TNSE.2024.344702511:6(5790-5802)Online publication date: Nov-2024
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media