[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Discovery of genuine functional dependencies from relational data with missing values

Published: 01 April 2018 Publication History

Abstract

Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely due to missing values or some non-genuine FDs can be discovered even though they are caused by missing values with a certain NULL semantics. We define a notion of genuineness and propose algorithms to compute the genuineness score of a discovered FD. This can be used to identify the genuine FDs among the set of all valid dependencies that hold on the data. We evaluate the quality of our method over various real-world and semi-synthetic datasets with extensive experiments. The results show that our method performs well for relatively large FD sets and is able to accurately capture genuine FDs.

References

[1]
A. Badia and D. Lemire. Functional dependencies with null markers. Comput. J., 58(5):1160--1168, 2015.
[2]
A. Badia and D. Lemire. On desirable semantics of functional dependencies over databases with incomplete information. arXiv preprint arXiv:1703.08198, 2017.
[3]
L. E. Bertossi. Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011.
[4]
G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. Proc. of the VLDB Endowment, 3(1):197--207, 2010.
[5]
T. Bleifuß, S. Bülow, J. Frohnhofen, J. Risch, G. Wiese, S. Kruse, T. Papenbrock, and F. Naumann. Approximate discovery of functional dependencies for large datasets. In Proc. of the International Conference on Information and Knowledge Management (CIKM), 2016.
[6]
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In Proc. of the International Conference on Data Engineering (ICDE), pages 746--755, 2007.
[7]
L. Caruccio, V. Deufemia, and G. Polese. Relaxed functional dependencies: A survey of approaches. IEEE Transactions on Knowledge and Data Engineering, 28(1):147--165, 2016.
[8]
F. Chiang and R. J. Miller. A unified model for data and constraint repair. In Proc. of the International Conference on Data Engineering (ICDE), pages 446--457, 2011.
[9]
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB Journal, 16(4):523--544, 2007.
[10]
S. De and S. Kambhampati. Defining and mining functional dependencies in probabilistic databases. arXiv preprint arXiv:1005.4714, 2010.
[11]
R. D. De Veaux and D. J. Hand. How to lie with bad data. Statist. Sci., 20(3):231--238, 08 2005.
[12]
DemandGen. Assessing the impact of dirty data on sales & marketing performance, https://www.zoominfo.com/business/mktg/ebooks/dirtydataebook.pdf, 2017.
[13]
B. Efron. Missing data, imputation, and the bootstrap. Journal of the American Statistical Association, 89(426):463--475, 1994.
[14]
W. Fan and F. Geerts. Uniform dependency language for improving data quality. IEEE Data Engineering Bulletin, 34(3):34--42, 2011.
[15]
W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2012.
[16]
W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems (TODS), 33(2):6:1--6:48, 2008.
[17]
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB Journal, 21(2):213--238, 2012.
[18]
Gartner. Dirty data is a business problem, not an it problem, http://www.gartner.com/newsroom/id/501733, 2007.
[19]
A. Haug, F. Zachariassen, and D. van Liempd. The costs of poor data quality. Journal of Industrial Engineering and Management, 4(2):168--193, 2011.
[20]
Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In Proc. of the International Conference on Data Engineering (ICDE), pages 392--401, 1998.
[21]
J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theor. Comput. Sci., 149(1):129--149, 1995.
[22]
C. Koch and D. Olteanu. Conditioning probabilistic databases. Proceedings of the VLDB Endowment, 1(1):313--325, 2008.
[23]
H. Köhler, U. Leck, S. Link, and X. Zhou. Possible and certain keys for SQL. VLDB Journal, 25(4):571--596, 2016.
[24]
H. Köhler, S. Link, and X. Zhou. Discovering meaningful certain keys from incomplete and inconsistent relations. IEEE Data Engineering Bulletin, 39(2):21--37, 2016.
[25]
M. Levene and G. Loizou. Axiomatisation of functional dependencies in incomplete relations. Theoretical Computer Science, 206(1--2):283--300, 1998.
[26]
M. Lichman. UCI machine learning repository http://archive.ics.uci.edu/ml, 2013.
[27]
J. Liu, J. Li, C. Liu, and Y. Chen. Discover dependencies from data - a review. IEEE Transactions on Knowledge and Data Engineering (TKDE), 24(2):251--264, 2012.
[28]
M. Mazuran, E. Quintarelli, L. Tanca, and S. Ugolini. Semi-automatic support for evolving functional dependencies. In Proc. of the International Conference on Extending Database Technology (EDBT), pages 293--304, 2016.
[29]
N. Novelli and R. Cicchetti. Fun: An efficient algorithm for mining functional and embedded dependencies. In Proc. of the International Conference on Database Theory (ICDT), pages 189--203, 2001.
[30]
T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J.-P. Rudolph, M. Schnberg, J. Zwiener, and F. Naumann. Functional dependency discovery: An experimental evaluation of seven algorithms. Proc. of the VLDB Endowment, 8(10):1082--1093, 2015.
[31]
T. Papenbrock and F. Naumann. Data-driven schema normalization. In Proc. of the International Conference on Extending Database Technology (EDBT), 2017.
[32]
S. Song, A. Zhang, L. Chen, and J. Wang. Enriching data imputation with extensive similarity neighbors. Proc. of the VLDB Endowment, 8(11):1286--1297, 2015.
[33]
S. Van Buuren. CRC/Chapman & Hall, 2012.
[34]
D. Z. Wang, X. L. Dong, A. D. Sarma, M. J. Franklin, and A. Y. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In Proc. of the ACM SIGMOD Workshop on the Web and Databases (WebDB), 2009.

Cited By

View all
  • (2024)Mixed Covers of Keys and Functional Dependencies for Maintaining the Integrity of Data under UpdatesProceedings of the VLDB Endowment10.14778/3654621.365462617:7(1578-1590)Online publication date: 1-Mar-2024
  • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Consideration of Detecting Data and Functional Dependency in Tabular Data with Missing Values by the Obtained RulesRough Sets10.1007/978-3-031-65665-1_8(120-133)Online publication date: 17-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 11, Issue 8
April 2018
94 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 April 2018
Published in PVLDB Volume 11, Issue 8

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)4
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Mixed Covers of Keys and Functional Dependencies for Maintaining the Integrity of Data under UpdatesProceedings of the VLDB Endowment10.14778/3654621.365462617:7(1578-1590)Online publication date: 1-Mar-2024
  • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Consideration of Detecting Data and Functional Dependency in Tabular Data with Missing Values by the Obtained RulesRough Sets10.1007/978-3-031-65665-1_8(120-133)Online publication date: 17-May-2024
  • (2023)Missing Value Imputation for Multi-Attribute Sensor Data Streams via Message PropagationProceedings of the VLDB Endowment10.14778/3632093.363210017:3(345-358)Online publication date: 1-Nov-2023
  • (2023)Exploratory Training: When Annonators Learn About DataProceedings of the ACM on Management of Data10.1145/35892801:2(1-25)Online publication date: 20-Jun-2023
  • (2023)Distributional constraint discovery for intelligent auditingKnowledge and Information Systems10.1007/s10115-023-01929-z65:12(5195-5229)Online publication date: 7-Aug-2023
  • (2022)Statistical Schema Learning with Occam's RazorProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526174(176-189)Online publication date: 10-Jun-2022
  • (2021)Efficient and effective data imputation with influence functionsProceedings of the VLDB Endowment10.14778/3494124.349414315:3(624-632)Online publication date: 1-Nov-2021
  • (2021)ORBITSProceedings of the VLDB Endowment10.14778/3430915.343092014:3(294-306)Online publication date: 9-Dec-2021
  • (2021)Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data LakesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457250(1678-1691)Online publication date: 9-Jun-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media