[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Discovering data quality rules

Published: 01 August 2008 Publication History

Abstract

Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code.
In this work, we propose a new data-driven tool that can be used within an organization's data quality management process to suggest possible rules, and to identify conformant and non-conformant records. Data quality rules are known to be contextual, so we focus on the discovery of context-dependent rules. Specifically, we search for conditional functional dependencies (CFDs), that is, functional dependencies that hold only over a portion of the data. The output of our tool is a set of functional dependencies together with the context in which they hold (for example, a rule that states for CS graduate courses, the course number and term functionally determines the room and instructor). Since the input to our tool will likely be a dirty database, we also search for CFDs that almost hold. We return these rules together with the non-conformant records (as these are potentially dirty records).
We present effective algorithms for discovering CFDs and dirty values in a data instance. Our discovery algorithm searches for minimal CFDs among the data values and prunes redundant candidates. No universal objective measures of data quality or data quality rules are known. Hence, to avoid returning an unnecessarily large number of CFDs and only those that are most interesting, we evaluate a set of interest metrics and present comparative results using real datasets. We also present an experimental study showing the scalability of our techniques.

References

[1]
C. Aggarwal and P. Yu. Mining large itemsets for association rules. Data Eng Bull, 21(1):23--31, 1998.
[2]
R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD '93, pages 207--216, 1993.
[3]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB '94, pages 487--499, 1994.
[4]
A. Asuncion and D. Newman. UCI ML repository, 2007.
[5]
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE '07, pages 746--755, 2007.
[6]
L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In VLDB '07, pages 243--254, 2007.
[7]
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD '97, pages 255--264, 1997.
[8]
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE TKDE, 13(1):64--78, 2001.
[9]
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB '07, pages 315--326, 2007.
[10]
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or how to build a data quality browser. In SIGMOD '02, pages 240--251.
[11]
W. Eckerson. Achieving Business Success through a Commitment to High Quality Data. Technical report, Data Warehousing Institute, 2002.
[12]
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB '01, pages 371--380, 2001.
[13]
Y. Huhtala, J. Kinen, P. Porkka, and H. Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In ICDE '98, pages 392--401, 1998.
[14]
H. V. Jagadish, J. Madar, and R. T. Ng. Semantic compression and pattern extraction with fascicles. In VLDB '99, pages 186--198, 1999.
[15]
H. V. Jagadish, R. T. Ng, B. C. Ooi, and A. K. H. Tung. Itcompress: An iterative semantic compression algorithm. In ICDE '04, page 646, 2004.
[16]
J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, 149(1):129--149, 1995.
[17]
L. V. S. Lakshmanan, R. T. Ng, J. Han, and A. Pang. Optimization of constrained frequent set queries with 2-variable constraints. In SIGMOD '99, pages 157--168.
[18]
S. Lopes, J.-M. Petit, and L. Lakhal. Efficient discovery of functional dependencies and armstrong relations. In EDBT '00, pages 350--364, 2000.
[19]
M. J. Maher. Constrained dependencies. Theoretical Computer Science, 173(1):113--149, 1997.
[20]
J. Pei and J. Han. Constrained frequent pattern mining: a pattern-growth view. SIGKDD Explor. '02, 4(1):31--39.
[21]
V. Raman and J. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB '01, pages 381--390, 2001.
[22]
I. Savnik and P. A. Flach. Discovery of multivalued dependencies from relations. Intelligent Data Analysis, 4(3,4):195--211, 2000.
[23]
P. Vassiliadis, Z. Vagena, S. Skiadopoulos, N. Karayannidis, and T. Sellis. Arktos: towards the modeling, design, control and execution of etl processes. Inf. Syst. '01, 26(8):537--561.
[24]
C. Wyss, C. Giannella, and E. L. Robertson. Fastfds: A heuristic-driven, depth-first alg for mining fds from relations. In DaWaK '01, pages 101--110, 2001.
[25]
M. Zaki. Generating non-redundant association rules. In KDD '00, pages 34--43, 2000.

Cited By

View all
  • (2024)DAFDiscover: Robust Mining Algorithm for Dynamic Approximate Functional Dependencies on Dirty DataProceedings of the VLDB Endowment10.14778/3681954.368201517:11(3484-3496)Online publication date: 1-Jul-2024
  • (2024)Use of Context in Data Quality Management: A Systematic Literature ReviewJournal of Data and Information Quality10.1145/367208216:3(1-41)Online publication date: 17-Jun-2024
  • (2023)A Method to Classify Data Quality for Decision Making Under UncertaintyJournal of Data and Information Quality10.1145/359253415:2(1-27)Online publication date: 21-Apr-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 1, Issue 1
August 2008
1216 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2008
Published in PVLDB Volume 1, Issue 1

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)108
  • Downloads (Last 6 weeks)14
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)DAFDiscover: Robust Mining Algorithm for Dynamic Approximate Functional Dependencies on Dirty DataProceedings of the VLDB Endowment10.14778/3681954.368201517:11(3484-3496)Online publication date: 1-Jul-2024
  • (2024)Use of Context in Data Quality Management: A Systematic Literature ReviewJournal of Data and Information Quality10.1145/367208216:3(1-41)Online publication date: 17-Jun-2024
  • (2023)A Method to Classify Data Quality for Decision Making Under UncertaintyJournal of Data and Information Quality10.1145/359253415:2(1-27)Online publication date: 21-Apr-2023
  • (2023)Distributional constraint discovery for intelligent auditingKnowledge and Information Systems10.1007/s10115-023-01929-z65:12(5195-5229)Online publication date: 7-Aug-2023
  • (2023)ML Support for Conformity Checks in CMDB-Like DatabasesArtificial Intelligence and Soft Computing10.1007/978-3-031-42508-0_33(366-376)Online publication date: 18-Jun-2023
  • (2023)Efficient Anomaly Detection in Property GraphsDatabase Systems for Advanced Applications10.1007/978-3-031-30675-4_9(120-136)Online publication date: 17-Apr-2023
  • (2022)A Data Cleaning Method for Industrial Data Flow Based on Multistage Combinational Optimization of Rule SetProceedings of the 2022 5th International Conference on Big Data and Internet of Things10.1145/3561801.3561814(77-81)Online publication date: 12-Aug-2022
  • (2022)BR4DQInformation Systems10.1016/j.is.2022.102058109:COnline publication date: 1-Nov-2022
  • (2021)Approximate denial constraintsProceedings of the VLDB Endowment10.14778/3401960.340196613:10(1682-1695)Online publication date: 10-Mar-2021
  • (2021)Efficient Discovery of Functional Dependencies from Incremental DatabasesThe 23rd International Conference on Information Integration and Web Intelligence10.1145/3487664.3487719(400-409)Online publication date: 29-Nov-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media