[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Contextual Data Cleaning with Ontology Functional Dependencies

Published: 23 May 2022 Publication History

Abstract

Functional Dependencies define attribute relationships based on syntactic equality, and when used in data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We explore dependency-based data cleaning with Ontology Functional Dependencies (OFDs), which express semantic attribute relationships such as synonyms defined by an ontology. We study the theoretical foundations of OFDs, including sound and complete axioms and a linear-time inference procedure. We then propose an algorithm for discovering OFDs (exact ones and ones that hold with some exceptions) from data that uses the axioms to prune the search space. Toward enabling OFDs as data quality rules in practice, we study the problem of finding minimal repairs to a relation and ontology with respect to a set of OFDs. We demonstrate the effectiveness of our techniques on real datasets and show that OFDs can significantly reduce the number of false positive errors in data cleaning techniques that rely on traditional Functional Dependencies.

References

[1]
Ontobee. n.d. The Drug Ontology. Retrieved April 28, 2022 from http://www.ontobee.org/ontology/DRON.
[2]
Kaggle. n.d. Data Science for Good: Kiva Crowdfunding.Retrieved April 28, 2022 from https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding?select=loan_themes_by_region.csv.
[3]
NIH. 2016. Medical Ontology Research. Retrieved April 28, 2022 from https://mor.nlm.nih.gov.
[4]
Z. Abedjan, P. Schulze, and F. Naumann. 2014. DFD: Efficient functional dependency discovery. In Proceedings of CIKM. 949–958.
[5]
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). AAAI Press, Menlo Park, CA, 307–328.
[6]
S. Baskaran, A. Keller, F. Chiang, L. Golab, and J. Szlichta. 2017. Efficient discovery of ontology functional dependencies. In Proceedings of CIKM. 1847–1856.
[7]
L. Bertossi, S. Kolahi, and L. Lakshmanan. 2011. Data cleaning and query answering with matching dependencies and matching functions. In Proceedings of ICDT. 268–279.
[8]
G. Beskales, Ihab F. Ilyas, Lukasz Golab, and Artur Galiullin. 2013. On the relative trust between inconsistent data and inaccurate constraints. In Proceedings of ICDE. 541–552.
[9]
T. Bläsius, T. Friedrich, and M. Schirneck. 2022. The complexity of dependency detection and discovery in relational databases. Theor. Comput. Sci. 900 (2022), 79–96.
[10]
P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of SIGMOD. 143–154.
[11]
F. Chiang and D. Gairola. 2018. InfoClean: Protecting sensitive information in data cleaning. ACM J. Data. Inf. Qual. 9, 4 (2018), Article 22.
[12]
F. Chiang and R. J. Miller. 2011. A unified model for data and constraint repair. In Proceedings of ICDE. 446–457.
[13]
X. Chu, I. Ilyas, and P. Papotti. 2013. Discovering denial constraints. Proc. VLDB Endow. 6, 13 (2013), 1498–1509.
[14]
X. Chu, I. Ilyas, and P. Papotti. 2013. Holistic data cleaning: Putting violations into context. In Proceedings of ICDE. 458–469.
[15]
X. Chu, J. Morcos, I. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. 2015. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of SIGMOD. 1247–1261.
[16]
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. 2007. Improving data quality: Consistency and accuracy. In Proceedings of VLDB. 315–326.
[17]
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. 2011. Interaction between record matching and data repairing. In Proceedings of SIGMOD. 469–480.
[18]
T. Ferguson. 1989. Who solved the secretary problem?Statist. Sci. 4, 3 (1989), 294–296.
[19]
P. A. Flach and I. Savnik. 1999. Database dependency discovery: A machine learning approach. AI Commun. 12, 3 (1999), 139–160.
[20]
M. R. Garey and D. S. Johnson.1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.
[21]
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. 2013. The LLUNATIC data-cleaning framework. Proc. VLDB Endow. 6, 9 (2013), 625–636.
[22]
S. Hao, N. Tang, G. Li, J. He, N. Ta, and J. Feng. 2017. A novel cost-based model for data repairing. IEEE Trans. Knowl. Data Eng. 29, 4 (2017), 727–742.
[23]
O. Hassanzadeh, A. Kementsietsidis, L. Lim, R. J. Miller, and M. Wang. 2009. LinkedCT: A Linked Data Space for Clinical Trials. Technical Report CSRG-596. University of Toronto.
[24]
Y. Huang, M. Milani, and F. Chiang. 2018. PACAS: Privacy-aware, data cleaning-as-a-service. In Proceedings ofBig Data. 1023–1030.
[25]
Y. Huhtala, J. Kinen, P. Porkka, and H. Toivonen. 1998. Efficient discovery of functional and approximate dependencies using partitions. In Proceedings of ICDE. 392–401.
[26]
V. Huynh and P. Papotti. 2018. Towards a benchmark for fact checking with knowledge bases. In Companion Proceedings of the Web Conference 2018. International World Wide Web Conferences Steering Committee, 1595–1598.
[27]
S. Kolahi and L. V. S. Lakshmanan. 2009. On approximating optimum repairs for functional dependency violations. In Proceedings of ICDT. 53–62.
[28]
N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. 2009. Metric functional dependencies. In Proceedings of ICDE. 1275–1278.
[29]
M. Levene and G. Loizou. 1998. Axiomatisation of functional dependencies in incomplete relations. Theor. Comput. Sci. 206, 1–2 (Oct. 1998), 283–300.
[30]
Y. E. Lien. 1982. On the equivalence of database models. J. ACM 29, 2 (1982), 333–362.
[31]
E. Livshits, B. Kimelfeld, and S. Roy. 2020. Computing optimal repairs for functional dependencies. ACM Trans. Database Syst. 45, 1 (2020), Article 4.
[32]
S. Lopes, J. Petit, and L. Lakhal. 2000. Efficient discovery of functional dependencies and armstrong relations. In Proceedings of EDBT. 350–364.
[33]
B. Lowerre. 1976. The HARPY Speech Recognition System. Ph.D. Dissertation, Carnegie-Mellon University.
[34]
H. Ma, M. Alipourlangouri, Y. Wu, F. Chiang, and J. Pi. 2019. Ontology-based entity matching in attributed graphs. Proc. VLDB Endow. 12, 10 (2019), 1195–1207.
[35]
M. Mahdavi and Z. Abedjan. 2020. BARAN: Effective error correction via a unified context representation and transfer learning. Proc. VLDB Endow. 13, 11 (2020), 1948–1961.
[36]
D. Miao, Z. Cai, J. Li, X. Gao, and X. Liu. 2020. The computation of optimal subset repairs. Proc. VLDB Endow. 13, 12 (2020), 2061–2074.
[37]
B. Motik, I. Horrocks, and U. Sattler. 2007. Adding integrity constraints to OWL. In Proceedings of OWLED.
[38]
W. Ng. 2001. An extension of the relational data model to incorporate ordered domains. ACM Trans. Database Syst. 26, 3 (2001), 344–383.
[39]
N. Novelli and R. Cicchetti. 2001. FUN: An efficient algorithm for mining functional and embedded dependencies. In Proceedings of ICDT. 189–203.
[40]
S. Ortona, V. Meduri, and P. Papotti. 2018. Robust discovery of positive and negative rules in knowledge bases. In Proceedings of ICDE. 1168–1179.
[41]
T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. 2015. Data profiling with Metanome. Proc. VLDB Endow. 8, 12 (2015), 1860–1871.
[42]
T. Papenbrock, S. Kruse, J. Quiané-Ruiz, and F. Naumann. 2015. Divide and conquer-based inclusion dependency discovery. Proc. VLDB Endow. 8, 7 (2015), 774–785.
[43]
N. Prokoshyna, J. Szlichta, F. Chiang, R. Miller, and D. Srivastava. 2015. Combining quantitative and logical data cleaning. Proc. VLDB Endow. 9, 4 (2015), 300–311.
[44]
R. Yossi, T. Carlo, and J. Leonidas. 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 2, 40 (2000), 99–121.
[45]
T. Rekatsinas, X. Chu, I. Ilyas, and C. Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10 (2017), 1190–1201.
[46]
P. J. Rousseeuw and C. Croux. 1993. Alternatives to the median absolute deviation. J. Amer. Statist. Assoc. 88, 424 (1993), 1273–1283.
[47]
Y. Rubner, C. Tomasi, and L. J. Guibas. 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40, 2 (2000), 99–121.
[48]
J. Szlichta, P. Godfrey, L. Golab, M. Kargar, and D. Srivastava. 2017. Effective and complete discovery of order dependencies via set-based axiomatization. Proc. VLDB Endow. 10, 7 (2017), 721–732.
[49]
J. Szlichta, P. Godfrey, and J. Gryz. 2012. Fundamentals of order dependencies. Proc. VLDB Endow. 5, 11 (2012), 1220–1231.
[50]
U.S. Food and Drug Administration. 2018. National Drug Code Directory. Retrieved April 28, 2022 from https://www.fda.gov.
[51]
C. Wyss, C. Giannella, and E. Robertson. 2001. FastFDs: A heuristic-driven, depth-first algorithm for mining FDs from relations. In Proceedings of DaWaK. 101–110.
[52]
M. Yakout, L. Berti-Équille, and A. Elmagarmid. 2013. Don’t be SCAREd: Use SCalable Automatic REpairing with maximal likelihood and bounded changes. In Proceedings of SIGMOD. 553–564.
[53]
H. Yao and H. Hamilton. 2008. Mining functional dependencies from data. Data Min. Knowl. Discov. 16, 2 (2008), 197–219.
[54]
Z. Zheng, L. Zheng, M. Alipour Langouri, F. Chiang, L. Golab, and J. Szlichta. 2022. Discovery and contextual data cleaning with ontology functional dependencies. arXiv:2105.08105 (2022).

Cited By

View all
  • (2024)Mining Keys for GraphsData & Knowledge Engineering10.1016/j.datak.2023.102274150:COnline publication date: 2-Jul-2024
  • (2023)RTClean: Context-aware Tabular Data Cleaning using Real-time OFDs2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)10.1109/PerComWorkshops56833.2023.10150361(546-551)Online publication date: 13-Mar-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 14, Issue 3
September 2022
155 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3533272
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 May 2022
Online AM: 21 April 2022
Accepted: 01 March 2022
Received: 01 June 2021
Revised: 01 January 2021
Published in JDIQ Volume 14, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data cleaning
  2. ontology functional dependencies

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)111
  • Downloads (Last 6 weeks)11
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Mining Keys for GraphsData & Knowledge Engineering10.1016/j.datak.2023.102274150:COnline publication date: 2-Jul-2024
  • (2023)RTClean: Context-aware Tabular Data Cleaning using Real-time OFDs2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)10.1109/PerComWorkshops56833.2023.10150361(546-551)Online publication date: 13-Mar-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media