More Web Proxy on the site http://driver.im/

research-article

Discovery of genuine functional dependencies from relational data with missing values

Authors:

Laure Berti-Équille,

Hazar Harmouch,

Saravanan ThirumuruganathanAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 11, Issue 8

Pages 880 - 892

https://doi.org/10.14778/3204028.3204032

Published: 01 April 2018 Publication History

Abstract

Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely due to missing values or some non-genuine FDs can be discovered even though they are caused by missing values with a certain NULL semantics. We define a notion of genuineness and propose algorithms to compute the genuineness score of a discovered FD. This can be used to identify the genuine FDs among the set of all valid dependencies that hold on the data. We evaluate the quality of our method over various real-world and semi-synthetic datasets with extensive experiments. The results show that our method performs well for relatively large FD sets and is able to accurately capture genuine FDs.

References

[1]

A. Badia and D. Lemire. Functional dependencies with null markers. Comput. J., 58(5):1160--1168, 2015.

[2]

A. Badia and D. Lemire. On desirable semantics of functional dependencies over databases with incomplete information. arXiv preprint arXiv:1703.08198, 2017.

[3]

L. E. Bertossi. Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011.

Digital Library

[4]

G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. Proc. of the VLDB Endowment, 3(1):197--207, 2010.

Digital Library

[5]

T. Bleifuß, S. Bülow, J. Frohnhofen, J. Risch, G. Wiese, S. Kruse, T. Papenbrock, and F. Naumann. Approximate discovery of functional dependencies for large datasets. In Proc. of the International Conference on Information and Knowledge Management (CIKM), 2016.

Digital Library

[6]

P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In Proc. of the International Conference on Data Engineering (ICDE), pages 746--755, 2007.

[7]

L. Caruccio, V. Deufemia, and G. Polese. Relaxed functional dependencies: A survey of approaches. IEEE Transactions on Knowledge and Data Engineering, 28(1):147--165, 2016.

Digital Library

[8]

F. Chiang and R. J. Miller. A unified model for data and constraint repair. In Proc. of the International Conference on Data Engineering (ICDE), pages 446--457, 2011.

Digital Library

[9]

N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB Journal, 16(4):523--544, 2007.

Digital Library

[10]

S. De and S. Kambhampati. Defining and mining functional dependencies in probabilistic databases. arXiv preprint arXiv:1005.4714, 2010.

[11]

R. D. De Veaux and D. J. Hand. How to lie with bad data. Statist. Sci., 20(3):231--238, 08 2005.

[12]

DemandGen. Assessing the impact of dirty data on sales & marketing performance, https://www.zoominfo.com/business/mktg/ebooks/dirtydataebook.pdf, 2017.

[13]

B. Efron. Missing data, imputation, and the bootstrap. Journal of the American Statistical Association, 89(426):463--475, 1994.

[14]

W. Fan and F. Geerts. Uniform dependency language for improving data quality. IEEE Data Engineering Bulletin, 34(3):34--42, 2011.

[15]

W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2012.

Digital Library

[16]

W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems (TODS), 33(2):6:1--6:48, 2008.

Digital Library

[17]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB Journal, 21(2):213--238, 2012.

Digital Library

[18]

Gartner. Dirty data is a business problem, not an it problem, http://www.gartner.com/newsroom/id/501733, 2007.

[19]

A. Haug, F. Zachariassen, and D. van Liempd. The costs of poor data quality. Journal of Industrial Engineering and Management, 4(2):168--193, 2011.

[20]

Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In Proc. of the International Conference on Data Engineering (ICDE), pages 392--401, 1998.

Digital Library

[21]

J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theor. Comput. Sci., 149(1):129--149, 1995.

Digital Library

[22]

C. Koch and D. Olteanu. Conditioning probabilistic databases. Proceedings of the VLDB Endowment, 1(1):313--325, 2008.

Digital Library

[23]

H. Köhler, U. Leck, S. Link, and X. Zhou. Possible and certain keys for SQL. VLDB Journal, 25(4):571--596, 2016.

Digital Library

[24]

H. Köhler, S. Link, and X. Zhou. Discovering meaningful certain keys from incomplete and inconsistent relations. IEEE Data Engineering Bulletin, 39(2):21--37, 2016.

[25]

M. Levene and G. Loizou. Axiomatisation of functional dependencies in incomplete relations. Theoretical Computer Science, 206(1--2):283--300, 1998.

Digital Library

[26]

M. Lichman. UCI machine learning repository http://archive.ics.uci.edu/ml, 2013.

[27]

J. Liu, J. Li, C. Liu, and Y. Chen. Discover dependencies from data - a review. IEEE Transactions on Knowledge and Data Engineering (TKDE), 24(2):251--264, 2012.

Digital Library

[28]

M. Mazuran, E. Quintarelli, L. Tanca, and S. Ugolini. Semi-automatic support for evolving functional dependencies. In Proc. of the International Conference on Extending Database Technology (EDBT), pages 293--304, 2016.

[29]

N. Novelli and R. Cicchetti. Fun: An efficient algorithm for mining functional and embedded dependencies. In Proc. of the International Conference on Database Theory (ICDT), pages 189--203, 2001.

Digital Library

[30]

T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J.-P. Rudolph, M. Schnberg, J. Zwiener, and F. Naumann. Functional dependency discovery: An experimental evaluation of seven algorithms. Proc. of the VLDB Endowment, 8(10):1082--1093, 2015.

Digital Library

[31]

T. Papenbrock and F. Naumann. Data-driven schema normalization. In Proc. of the International Conference on Extending Database Technology (EDBT), 2017.

[32]

S. Song, A. Zhang, L. Chen, and J. Wang. Enriching data imputation with extensive similarity neighbors. Proc. of the VLDB Endowment, 8(11):1286--1297, 2015.

Digital Library

[33]

S. Van Buuren. CRC/Chapman & Hall, 2012.

[34]

D. Z. Wang, X. L. Dong, A. D. Sarma, M. J. Franklin, and A. Y. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In Proc. of the ACM SIGMOD Workshop on the Web and Databases (WebDB), 2009.

Cited By

Wibisono ADenny Mursanto PSee S(2025)Natural generative noise diffusion model imputationKnowledge-Based Systems10.1016/j.knosys.2024.112310301:COnline publication date: 7-Jan-2025
https://dl.acm.org/doi/10.1016/j.knosys.2024.112310
Sakai HNakata MŚlęzak DWatada J(2025)Rule generation in Rough set Non-deterministic Information Analysis (RNIA) and some applications of the obtained rulesApplied Soft Computing10.1016/j.asoc.2025.112842172(112842)Online publication date: Mar-2025
https://doi.org/10.1016/j.asoc.2025.112842
Zhang ZLink S(2024)Mixed Covers of Keys and Functional Dependencies for Maintaining the Integrity of Data under UpdatesProceedings of the VLDB Endowment10.14778/3654621.365462617:7(1578-1590)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.14778/3654621.3654626
Show More Cited By

Recommendations

A Survey Study on XML Functional Dependencies
ISDPE '07: Proceedings of the The First International Symposium on Data, Privacy, and E-Commerce

There are two major kinds of XML functional dependency (FD) definitions. The first kind of XML FD includes Tree-tuple-based XML FD (tFD) and Path-based XML FD (pFD), and the second kind of XML FD includes Extended-path-based XML FD (epFD), Sub-graph-...
Functional dependencies for XML
WAIM'10: Proceedings of the 2010 international conference on Web-age information management

In this paper, we present a new approach for defining functional dependencies for XML (XFDs) on XML Schema. While showing how to extend XML Schema, we analyze the expressive power of our XFDs. We focus on supporting complex value (e.g. list, set) in our ...
Missing Values: Proposition of a Typology and Characterization with an Association Rule-Based Model
DaWaK '09: Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery

Handling missing values when tackling real-world datasets is a great challenge arousing the interest of many scientific communities. Many works propose completion methods or implement new data mining techniques tolerating the presence of missing values. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 11, Issue 8

April 2018

94 pages

ISSN:2150-8097

Editors:
Jian Pei
Simon Fraser University
,
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 April 2018

Published in PVLDB Volume 11, Issue 8

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
173
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wibisono ADenny Mursanto PSee S(2025)Natural generative noise diffusion model imputationKnowledge-Based Systems10.1016/j.knosys.2024.112310301:COnline publication date: 7-Jan-2025
https://dl.acm.org/doi/10.1016/j.knosys.2024.112310
Sakai HNakata MŚlęzak DWatada J(2025)Rule generation in Rough set Non-deterministic Information Analysis (RNIA) and some applications of the obtained rulesApplied Soft Computing10.1016/j.asoc.2025.112842172(112842)Online publication date: Mar-2025
https://doi.org/10.1016/j.asoc.2025.112842
Zhang ZLink S(2024)Mixed Covers of Keys and Functional Dependencies for Maintaining the Integrity of Data under UpdatesProceedings of the VLDB Endowment10.14778/3654621.365462617:7(1578-1590)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.14778/3654621.3654626
Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Xu RWu D(2024)A Multi-grained Cascade Structure for Online Sparse Streaming Feature Selection2024 7th International Symposium on Autonomous Systems (ISAS)10.1109/ISAS61044.2024.10552495(1-6)Online publication date: 7-May-2024
https://doi.org/10.1109/ISAS61044.2024.10552495
Parciak MWeytjens SHens NNeven FPeeters LVansummeren S(2024)Measuring Approximate Functional Dependencies: A Comparative Study2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00270(3505-3518)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00270
Amico BCombi CRizzi RSala P(2024)Predictive mining of multi-temporal relationsInformation and Computation10.1016/j.ic.2024.105228301:PAOnline publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1016/j.ic.2024.105228
Luo LFan ZChen YLiu X(2024)Cyclic Generative Adversarial Networks with KNN-transformers for missing traffic data completionApplied Soft Computing10.1016/j.asoc.2024.112406167:PCOnline publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1016/j.asoc.2024.112406
Sakai HNakata MŚlęzak DWatada J(2024)Consideration of Detecting Data and Functional Dependency in Tabular Data with Missing Values by the Obtained RulesRough Sets10.1007/978-3-031-65665-1_8(120-133)Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1007/978-3-031-65665-1_8
SAKAI HNAKATA M(2023)Missing Value Imputation and an Attempt Toward Machine Learning by Rule Generation in Rough Set Non-Deterministic Information Analysisラフ集合非決定情報解析における欠損値補完とルール生成による機械学習に向けた試みJournal of Japan Society for Fuzzy Theory and Intelligent Informatics10.3156/jsoft.35.4_74635:4(746-758)Online publication date: 15-Nov-2023
https://doi.org/10.3156/jsoft.35.4_746
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents