More Web Proxy on the site http://driver.im/

research-article

Discovering data quality rules

Authors:

Renée J. MillerAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 1, Issue 1

Pages 1166 - 1177

https://doi.org/10.14778/1453856.1453980

Published: 01 August 2008 Publication History

Abstract

Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code.

In this work, we propose a new data-driven tool that can be used within an organization's data quality management process to suggest possible rules, and to identify conformant and non-conformant records. Data quality rules are known to be contextual, so we focus on the discovery of context-dependent rules. Specifically, we search for conditional functional dependencies (CFDs), that is, functional dependencies that hold only over a portion of the data. The output of our tool is a set of functional dependencies together with the context in which they hold (for example, a rule that states for CS graduate courses, the course number and term functionally determines the room and instructor). Since the input to our tool will likely be a dirty database, we also search for CFDs that almost hold. We return these rules together with the non-conformant records (as these are potentially dirty records).

We present effective algorithms for discovering CFDs and dirty values in a data instance. Our discovery algorithm searches for minimal CFDs among the data values and prunes redundant candidates. No universal objective measures of data quality or data quality rules are known. Hence, to avoid returning an unnecessarily large number of CFDs and only those that are most interesting, we evaluate a set of interest metrics and present comparative results using real datasets. We also present an experimental study showing the scalability of our techniques.

References

[1]

C. Aggarwal and P. Yu. Mining large itemsets for association rules. Data Eng Bull, 21(1):23--31, 1998.

[2]

R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD '93, pages 207--216, 1993.

Digital Library

[3]

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB '94, pages 487--499, 1994.

Digital Library

[4]

A. Asuncion and D. Newman. UCI ML repository, 2007.

[5]

P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE '07, pages 746--755, 2007.

[6]

L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In VLDB '07, pages 243--254, 2007.

Digital Library

[7]

S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD '97, pages 255--264, 1997.

Digital Library

[8]

E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE TKDE, 13(1):64--78, 2001.

Digital Library

[9]

G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB '07, pages 315--326, 2007.

Digital Library

[10]

T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or how to build a data quality browser. In SIGMOD '02, pages 240--251.

Digital Library

[11]

W. Eckerson. Achieving Business Success through a Commitment to High Quality Data. Technical report, Data Warehousing Institute, 2002.

[12]

H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB '01, pages 371--380, 2001.

Digital Library

[13]

Y. Huhtala, J. Kinen, P. Porkka, and H. Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In ICDE '98, pages 392--401, 1998.

Digital Library

[14]

H. V. Jagadish, J. Madar, and R. T. Ng. Semantic compression and pattern extraction with fascicles. In VLDB '99, pages 186--198, 1999.

Digital Library

[15]

H. V. Jagadish, R. T. Ng, B. C. Ooi, and A. K. H. Tung. Itcompress: An iterative semantic compression algorithm. In ICDE '04, page 646, 2004.

Digital Library

[16]

J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, 149(1):129--149, 1995.

Digital Library

[17]

L. V. S. Lakshmanan, R. T. Ng, J. Han, and A. Pang. Optimization of constrained frequent set queries with 2-variable constraints. In SIGMOD '99, pages 157--168.

Digital Library

[18]

S. Lopes, J.-M. Petit, and L. Lakhal. Efficient discovery of functional dependencies and armstrong relations. In EDBT '00, pages 350--364, 2000.

Digital Library

[19]

M. J. Maher. Constrained dependencies. Theoretical Computer Science, 173(1):113--149, 1997.

Digital Library

[20]

J. Pei and J. Han. Constrained frequent pattern mining: a pattern-growth view. SIGKDD Explor. '02, 4(1):31--39.

Digital Library

[21]

V. Raman and J. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB '01, pages 381--390, 2001.

Digital Library

[22]

I. Savnik and P. A. Flach. Discovery of multivalued dependencies from relations. Intelligent Data Analysis, 4(3,4):195--211, 2000.

Digital Library

[23]

P. Vassiliadis, Z. Vagena, S. Skiadopoulos, N. Karayannidis, and T. Sellis. Arktos: towards the modeling, design, control and execution of etl processes. Inf. Syst. '01, 26(8):537--561.

Digital Library

[24]

C. Wyss, C. Giannella, and E. L. Robertson. Fastfds: A heuristic-driven, depth-first alg for mining fds from relations. In DaWaK '01, pages 101--110, 2001.

Digital Library

[25]

M. Zaki. Generating non-redundant association rules. In KDD '00, pages 34--43, 2000.

Digital Library

Cited By

Ding XLu YWang HWang CLiu YWang J(2024)DAFDiscover: Robust Mining Algorithm for Dynamic Approximate Functional Dependencies on Dirty DataProceedings of the VLDB Endowment10.14778/3681954.368201517:11(3484-3496)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.14778/3681954.3682015
Serra FPeralta VMarotta AMarcel P(2024)Use of Context in Data Quality Management: A Systematic Literature ReviewJournal of Data and Information Quality10.1145/367208216:3(1-41)Online publication date: 4-Oct-2024
https://dl.acm.org/doi/10.1145/3672082
Mecca GPapotti PSantoro DVeltri E(2024)BUNNI: Learning Repair Actions in Rule-driven Data CleaningJournal of Data and Information Quality10.1145/366593016:2(1-31)Online publication date: 24-Jun-2024
https://dl.acm.org/doi/10.1145/3665930
Show More Cited By

Index Terms

Discovering data quality rules
1. Information systems

Recommendations

Discovering dynamic integrity rules with a rules-based tool for data quality analyzing
CompSysTech '10: Proceedings of the 11th International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing on International Conference on Computer Systems and Technologies

Rules based approaches for data quality solutions often use business rules or integrity rules for data monitoring purpose. Integrity rules are constraints on data derived from business rules into a formal form in order to allow computerization. One of ...
Discovering association rules change from large databases
AICI'11: Proceedings of the Third international conference on Artificial intelligence and computational intelligence - Volume Part I

Discovering association rules and association rules change (ARC) from existing large databases is an important problem. This paper presents an approach based on multi-hash chain structures to mine association rules change from large database with ...
Mining fuzzy association rules from low-quality data
Special Issue on Knowledge Extraction from Low Quality Data: Theoretical, Methodological and Practical Issues

Data mining is most commonly used in attempts to induce association rules from databases which can help decision-makers easily analyze the data and make good decisions regarding the domains concerned. Different studies have proposed methods for mining ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 1, Issue 1

August 2008

1216 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2008

Published in PVLDB Volume 1, Issue 1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

177
Total Citations
View Citations
1,872
Total Downloads

Downloads (Last 12 months)115
Downloads (Last 6 weeks)16

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ding XLu YWang HWang CLiu YWang J(2024)DAFDiscover: Robust Mining Algorithm for Dynamic Approximate Functional Dependencies on Dirty DataProceedings of the VLDB Endowment10.14778/3681954.368201517:11(3484-3496)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.14778/3681954.3682015
Serra FPeralta VMarotta AMarcel P(2024)Use of Context in Data Quality Management: A Systematic Literature ReviewJournal of Data and Information Quality10.1145/367208216:3(1-41)Online publication date: 4-Oct-2024
https://dl.acm.org/doi/10.1145/3672082
Mecca GPapotti PSantoro DVeltri E(2024)BUNNI: Learning Repair Actions in Rule-driven Data CleaningJournal of Data and Information Quality10.1145/366593016:2(1-31)Online publication date: 24-Jun-2024
https://dl.acm.org/doi/10.1145/3665930
Parciak MWeytjens SHens NNeven FPeeters LVansummeren S(2024)Measuring Approximate Functional Dependencies: A Comparative Study2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00270(3505-3518)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00270
Aripov NKamaletdinov SToxirov NEshmetova DBuriyev S(2024)Determination of the influence degree of technologies for issuing train traffic safety warningsE3S Web of Conferences10.1051/e3sconf/202453102007531(02007)Online publication date: 3-Jun-2024
https://doi.org/10.1051/e3sconf/202453102007
Liu DKwashie SZhang YZhou GBewong MWu XGuo XHe KFeng Z(2024)An efficient approach for discovering Graph Entity Dependencies (GEDs)Information Systems10.1016/j.is.2024.102421125:COnline publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1016/j.is.2024.102421
Barba-González CCaballero IVarela-Vaca ÁCruz-Lemus JGómez-López MNavas-Delgado I(2024)BIGOWL4DQInformation and Software Technology10.1016/j.infsof.2023.107378167:COnline publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1016/j.infsof.2023.107378
Yi JHan DChen CShen XZong L(2024)ARDN: Attention Re-distribution Network for Visual Question AnsweringArabian Journal for Science and Engineering10.1007/s13369-024-09067-6Online publication date: 1-May-2024
https://doi.org/10.1007/s13369-024-09067-6
Karegar RMirsafian MGodfrey PGolab LKargar MSrivastava DSzlichta J(2024)Discovering approximate implicit domain orders through order dependenciesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00847-y33:5(1257-1282)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1007/s00778-024-00847-y
Simard VRönnqvist MLebel LLehoux N(2023)A Method to Classify Data Quality for Decision Making Under UncertaintyJournal of Data and Information Quality10.1145/359253415:2(1-27)Online publication date: 22-Jun-2023
https://dl.acm.org/doi/10.1145/3592534
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents