More Web Proxy on the site http://driver.im/

research-article

Contextual Data Cleaning with Ontology Functional Dependencies

Authors:

Morteza Alipourlangouri,

Jaroslaw Szlichta,

Sridevi BaskaranAuthors Info & Claims

ACM Journal of Data and Information Quality (JDIQ), Volume 14, Issue 3

Article No.: 20, Pages 1 - 26

https://doi.org/10.1145/3524303

Published: 23 May 2022 Publication History

Abstract

Functional Dependencies define attribute relationships based on syntactic equality, and when used in data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We explore dependency-based data cleaning with Ontology Functional Dependencies (OFDs), which express semantic attribute relationships such as synonyms defined by an ontology. We study the theoretical foundations of OFDs, including sound and complete axioms and a linear-time inference procedure. We then propose an algorithm for discovering OFDs (exact ones and ones that hold with some exceptions) from data that uses the axioms to prune the search space. Toward enabling OFDs as data quality rules in practice, we study the problem of finding minimal repairs to a relation and ontology with respect to a set of OFDs. We demonstrate the effectiveness of our techniques on real datasets and show that OFDs can significantly reduce the number of false positive errors in data cleaning techniques that rely on traditional Functional Dependencies.

References

[1]

Ontobee. n.d. The Drug Ontology. Retrieved April 28, 2022 from http://www.ontobee.org/ontology/DRON.

[2]

Kaggle. n.d. Data Science for Good: Kiva Crowdfunding.Retrieved April 28, 2022 from https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding?select=loan_themes_by_region.csv.

[3]

NIH. 2016. Medical Ontology Research. Retrieved April 28, 2022 from https://mor.nlm.nih.gov.

[4]

Z. Abedjan, P. Schulze, and F. Naumann. 2014. DFD: Efficient functional dependency discovery. In Proceedings of CIKM. 949–958.

Digital Library

[5]

R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). AAAI Press, Menlo Park, CA, 307–328.

[6]

S. Baskaran, A. Keller, F. Chiang, L. Golab, and J. Szlichta. 2017. Efficient discovery of ontology functional dependencies. In Proceedings of CIKM. 1847–1856.

Digital Library

[7]

L. Bertossi, S. Kolahi, and L. Lakshmanan. 2011. Data cleaning and query answering with matching dependencies and matching functions. In Proceedings of ICDT. 268–279.

Digital Library

[8]

G. Beskales, Ihab F. Ilyas, Lukasz Golab, and Artur Galiullin. 2013. On the relative trust between inconsistent data and inaccurate constraints. In Proceedings of ICDE. 541–552.

Digital Library

[9]

T. Bläsius, T. Friedrich, and M. Schirneck. 2022. The complexity of dependency detection and discovery in relational databases. Theor. Comput. Sci. 900 (2022), 79–96.

Digital Library

[10]

P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of SIGMOD. 143–154.

Digital Library

[11]

F. Chiang and D. Gairola. 2018. InfoClean: Protecting sensitive information in data cleaning. ACM J. Data. Inf. Qual. 9, 4 (2018), Article 22.

[12]

F. Chiang and R. J. Miller. 2011. A unified model for data and constraint repair. In Proceedings of ICDE. 446–457.

Digital Library

[13]

X. Chu, I. Ilyas, and P. Papotti. 2013. Discovering denial constraints. Proc. VLDB Endow. 6, 13 (2013), 1498–1509.

Digital Library

[14]

X. Chu, I. Ilyas, and P. Papotti. 2013. Holistic data cleaning: Putting violations into context. In Proceedings of ICDE. 458–469.

Digital Library

[15]

X. Chu, J. Morcos, I. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. 2015. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of SIGMOD. 1247–1261.

Digital Library

[16]

G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. 2007. Improving data quality: Consistency and accuracy. In Proceedings of VLDB. 315–326.

[17]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. 2011. Interaction between record matching and data repairing. In Proceedings of SIGMOD. 469–480.

Digital Library

[18]

T. Ferguson. 1989. Who solved the secretary problem?Statist. Sci. 4, 3 (1989), 294–296.

[19]

P. A. Flach and I. Savnik. 1999. Database dependency discovery: A machine learning approach. AI Commun. 12, 3 (1999), 139–160.

Digital Library

[20]

M. R. Garey and D. S. Johnson.1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.

[21]

F. Geerts, G. Mecca, P. Papotti, and D. Santoro. 2013. The LLUNATIC data-cleaning framework. Proc. VLDB Endow. 6, 9 (2013), 625–636.

Digital Library

[22]

S. Hao, N. Tang, G. Li, J. He, N. Ta, and J. Feng. 2017. A novel cost-based model for data repairing. IEEE Trans. Knowl. Data Eng. 29, 4 (2017), 727–742.

Digital Library

[23]

O. Hassanzadeh, A. Kementsietsidis, L. Lim, R. J. Miller, and M. Wang. 2009. LinkedCT: A Linked Data Space for Clinical Trials. Technical Report CSRG-596. University of Toronto.

[24]

Y. Huang, M. Milani, and F. Chiang. 2018. PACAS: Privacy-aware, data cleaning-as-a-service. In Proceedings ofBig Data. 1023–1030.

[25]

Y. Huhtala, J. Kinen, P. Porkka, and H. Toivonen. 1998. Efficient discovery of functional and approximate dependencies using partitions. In Proceedings of ICDE. 392–401.

Digital Library

[26]

V. Huynh and P. Papotti. 2018. Towards a benchmark for fact checking with knowledge bases. In Companion Proceedings of the Web Conference 2018. International World Wide Web Conferences Steering Committee, 1595–1598.

Digital Library

[27]

S. Kolahi and L. V. S. Lakshmanan. 2009. On approximating optimum repairs for functional dependency violations. In Proceedings of ICDT. 53–62.

Digital Library

[28]

N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. 2009. Metric functional dependencies. In Proceedings of ICDE. 1275–1278.

Digital Library

[29]

M. Levene and G. Loizou. 1998. Axiomatisation of functional dependencies in incomplete relations. Theor. Comput. Sci. 206, 1–2 (Oct. 1998), 283–300.

Digital Library

[30]

Y. E. Lien. 1982. On the equivalence of database models. J. ACM 29, 2 (1982), 333–362.

Digital Library

[31]

E. Livshits, B. Kimelfeld, and S. Roy. 2020. Computing optimal repairs for functional dependencies. ACM Trans. Database Syst. 45, 1 (2020), Article 4.

Digital Library

[32]

S. Lopes, J. Petit, and L. Lakhal. 2000. Efficient discovery of functional dependencies and armstrong relations. In Proceedings of EDBT. 350–364.

[33]

B. Lowerre. 1976. The HARPY Speech Recognition System. Ph.D. Dissertation, Carnegie-Mellon University.

[34]

H. Ma, M. Alipourlangouri, Y. Wu, F. Chiang, and J. Pi. 2019. Ontology-based entity matching in attributed graphs. Proc. VLDB Endow. 12, 10 (2019), 1195–1207.

Digital Library

[35]

M. Mahdavi and Z. Abedjan. 2020. BARAN: Effective error correction via a unified context representation and transfer learning. Proc. VLDB Endow. 13, 11 (2020), 1948–1961.

Digital Library

[36]

D. Miao, Z. Cai, J. Li, X. Gao, and X. Liu. 2020. The computation of optimal subset repairs. Proc. VLDB Endow. 13, 12 (2020), 2061–2074.

Digital Library

[37]

B. Motik, I. Horrocks, and U. Sattler. 2007. Adding integrity constraints to OWL. In Proceedings of OWLED.

[38]

W. Ng. 2001. An extension of the relational data model to incorporate ordered domains. ACM Trans. Database Syst. 26, 3 (2001), 344–383.

Digital Library

[39]

N. Novelli and R. Cicchetti. 2001. FUN: An efficient algorithm for mining functional and embedded dependencies. In Proceedings of ICDT. 189–203.

[40]

S. Ortona, V. Meduri, and P. Papotti. 2018. Robust discovery of positive and negative rules in knowledge bases. In Proceedings of ICDE. 1168–1179.

[41]

T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. 2015. Data profiling with Metanome. Proc. VLDB Endow. 8, 12 (2015), 1860–1871.

Digital Library

[42]

T. Papenbrock, S. Kruse, J. Quiané-Ruiz, and F. Naumann. 2015. Divide and conquer-based inclusion dependency discovery. Proc. VLDB Endow. 8, 7 (2015), 774–785.

Digital Library

[43]

N. Prokoshyna, J. Szlichta, F. Chiang, R. Miller, and D. Srivastava. 2015. Combining quantitative and logical data cleaning. Proc. VLDB Endow. 9, 4 (2015), 300–311.

Digital Library

[44]

R. Yossi, T. Carlo, and J. Leonidas. 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 2, 40 (2000), 99–121.

[45]

T. Rekatsinas, X. Chu, I. Ilyas, and C. Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10 (2017), 1190–1201.

Digital Library

[46]

P. J. Rousseeuw and C. Croux. 1993. Alternatives to the median absolute deviation. J. Amer. Statist. Assoc. 88, 424 (1993), 1273–1283.

[47]

Y. Rubner, C. Tomasi, and L. J. Guibas. 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40, 2 (2000), 99–121.

Digital Library

[48]

J. Szlichta, P. Godfrey, L. Golab, M. Kargar, and D. Srivastava. 2017. Effective and complete discovery of order dependencies via set-based axiomatization. Proc. VLDB Endow. 10, 7 (2017), 721–732.

Digital Library

[49]

J. Szlichta, P. Godfrey, and J. Gryz. 2012. Fundamentals of order dependencies. Proc. VLDB Endow. 5, 11 (2012), 1220–1231.

Digital Library

[50]

U.S. Food and Drug Administration. 2018. National Drug Code Directory. Retrieved April 28, 2022 from https://www.fda.gov.

[51]

C. Wyss, C. Giannella, and E. Robertson. 2001. FastFDs: A heuristic-driven, depth-first algorithm for mining FDs from relations. In Proceedings of DaWaK. 101–110.

[52]

M. Yakout, L. Berti-Équille, and A. Elmagarmid. 2013. Don’t be SCAREd: Use SCalable Automatic REpairing with maximal likelihood and bounded changes. In Proceedings of SIGMOD. 553–564.

Digital Library

[53]

H. Yao and H. Hamilton. 2008. Mining functional dependencies from data. Data Min. Knowl. Discov. 16, 2 (2008), 197–219.

Digital Library

[54]

Z. Zheng, L. Zheng, M. Alipour Langouri, F. Chiang, L. Golab, and J. Szlichta. 2022. Discovery and contextual data cleaning with ontology functional dependencies. arXiv:2105.08105 (2022).

Cited By

Alipourlangouri MChiang F(2024)Mining Keys for GraphsData & Knowledge Engineering10.1016/j.datak.2023.102274150:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.datak.2023.102274
Del Gaudio DSchubert TAbdelaal M(2023)RTClean: Context-aware Tabular Data Cleaning using Real-time OFDs2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)10.1109/PerComWorkshops56833.2023.10150361(546-551)Online publication date: 13-Mar-2023
https://doi.org/10.1109/PerComWorkshops56833.2023.10150361

Index Terms

Contextual Data Cleaning with Ontology Functional Dependencies
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Inconsistent data
      2. Relational database model

Recommendations

Contextual Data Cleaning with Ontology FDs
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Functional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when used in data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We motivate the need to include ...
Efficient Discovery of Ontology Functional Dependencies
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Functional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when used in data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We enhance dependency-based data ...
Foundational Challenges in Automated Semantic Web Data and Ontology Cleaning

Applying automated reasoning systems to data cleaning in the Semantic Web raises many foundational challenges regarding cleaning-agent design. The authors discuss these challenges and then argue that we can achieve logic trust in the Semantic Web only ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 14, Issue 3

September 2022

155 pages

ISSN:1936-1955

EISSN:1936-1963

DOI:10.1145/3533272

Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 May 2022

Online AM: 21 April 2022

Accepted: 01 March 2022

Received: 01 June 2021

Revised: 01 January 2021

Published in JDIQ Volume 14, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
436
Total Downloads

Downloads (Last 12 months)89
Downloads (Last 6 weeks)5

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Alipourlangouri MChiang F(2024)Mining Keys for GraphsData & Knowledge Engineering10.1016/j.datak.2023.102274150:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.datak.2023.102274
Del Gaudio DSchubert TAbdelaal M(2023)RTClean: Context-aware Tabular Data Cleaning using Real-time OFDs2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)10.1109/PerComWorkshops56833.2023.10150361(546-551)Online publication date: 13-Mar-2023
https://doi.org/10.1109/PerComWorkshops56833.2023.10150361

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents