[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Data Quality and Explainable AI

Published: 03 May 2020 Publication History

Abstract

In this work, we provide some insights and develop some ideas, with few technical details, about the role of explanations in Data Quality in the context of data-based machine learning models (ML). In this direction, there are, as expected, roles for causality, and explainable artificial intelligence. The latter area not only sheds light on the models, but also on the data that support model construction. There is also room for defining, identifying, and explaining errors in data, in particular, in ML, and also for suggesting repair actions. More generally, explanations can be used as a basis for defining dirty data in the context of ML, and measuring or quantifying them. We think dirtiness as relative to the ML task at hand, e.g., classification.

References

[1]
Z. Bahmani, L. Bertossi, and N. Nikolaos Vasiloglou. 2017. ERBlox: Combining matching dependencies with machine learning for entity resolution. International Journal of Approximate Reasoning 83 (2017), 118--141.
[2]
C. Batini and M. Scannapieco. 2016. Data Quality: Concepts, Methodologies and Techniques. Second edition, Springer.
[3]
L. Bertossi and M. Milani. 2018. Ontological multidimensional data models and contextual data quality. Journal of Data and Information Quality 9, 3 (2018), 14.1--14.36.
[4]
L. Bertossi, F. Rizzolo, and J. Lei. 2011. Data quality is context dependent. In Proc. of the Workshop on Enabling Real-Time Business Intelligence (BIRTE) Collocated with the International Conference on Very Large Data Bases (VLDB). Springer LNBIP 84, 52--67.
[5]
L. Bertossi and B. Salimi. 2017. From causes for database queries to repairs and model-based diagnosis and back. Theory of Computing Systems 61, 1 (2017), 191--232.
[6]
L. Bertossi and B. Salimi. 2017. Causes for query answers from databases: Datalog abduction, view-updates, and integrity constraints. International Journal of Approximate Reasoning 90 (2017), 226--252.
[7]
L. Bertossi, S. Kolahi, and L. Lakshmanan. 2013. Data cleaning and query answering with matching dependencies and matching functions. Theory of Computing Systems 52, 3 (2013), 441--482.
[8]
L. Bertossi, J. Li, M. Schleich, D. Suciu, and Z. Vagena. [n.d.]. Experimenting with score-based explanations for classification outcomes. Forthcoming.
[9]
D. Calvanese, M. Ortiz, M. Simkus, and G. Stefanoni. 2013. Reasoning about explanations for negative query answers in DL-lite. Journal of Artificial Intelligence Research 48 (2013), 635--669.
[10]
D. Calvanese, D. Lanti, A. Ozaki, R. Peñaloza, and G. Xiao. 2019. Enriching ontology-based data access with provenance. In Proc. IJCAI.
[11]
A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. 2017. Descriptive and prescriptive data cleaning. In Proc. SIGMOD.
[12]
C. Chen, K. Lin, C. Rudin, Y. Shaposhnik, S. Wang, and T. Wang. [n.d.]. An interpretable model with globally consistent explanations for credit risk. In Proc. NIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services: the Impact of Fairness, Explainability, Accuracy, and Privacy.
[13]
H. Chockler and J. Y. Halpern. 2004. Responsibility and blame: A structural-model approach. Journal of Artificial Intelligence Research 22 (2004), 93--115.
[14]
F. Croce and M. Lenzerini. 2018. A framework for explaining query answers in DL-lite. In Proc. EKAW.
[15]
A. Datta, S. Sen, and Y. Zick. 2016. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In IEEE Symposium on Security and Privacy.
[16]
U. Draisbach, P. Christen, and F. Naumann. 2019. Transforming pairwise duplicates to entity clusters for high-quality duplicate detection. Journal of Data and Information Quality 12, 1 (2019), 3:1--3:30.
[17]
J. Du, K. Wang, and Y. Shen. 2014. A tractable approach to ABox abduction over description logic ontologies. In Proc. AAAI.
[18]
P. Dubey and L. S. Shapley. 1979. Mathematical properties of the Banzhaf power index. Mathematics of Operations Research 4, 2 (1979), 99--131.
[19]
W. Fan and F. Geerts. 2012. Foundations of Data Quality Management. Morgan 8 Claypool.
[20]
W. Fan, H. Gao, X. Ji, J. Li, and S. Ma. 2009. Dynamic constraints for record matching. The International Journal on Very Large Data Bases (VLDBJ) 20, 4 (2009), 495--520.
[21]
J. Halpern and J. Pearl. 2005. Causes and explanations: A structural-model approach: Part 1. British Journal of Philosophy of Science 56 (2005), 843--887.
[22]
A. Heidari, J. McGrath, I. F. Ilyas, and Th. Rekatsinas. 2019. HoloDetect: Few-shot learning for error detection. In Proc. Sigmod.
[23]
L. Jiang, A. Borgida, and J. Mylopoulos. 2008. Towards a compositional semantic account of data quality atrributes. In Proc. International Conference on Conceptual Modeling (ER). 55--68.
[24]
M. A. Khamis, H. Q. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. 2018. AC/DC: In-database learning thunderstruck. In Proc. DEEM.
[25]
P. Kouki, J. Pujara, C. Marcum, L. Koehly, and L. Getoor. 2019. Collective entity resolution in multi-relational familial networks. Knowledge and Information Systems 61, 3 (2019), 1547--1581.
[26]
B. Kimelfeld and C. Ré. 2017. A relational framework for classifier engineering. In Proc. PODS.
[27]
J. Kleinberg, J. Ludwig, S. Mullainathan, and A. Rambachan. 2018. Algorithmic fairness. AEA Papers and Proceedings 108 (2018), 22--27.
[28]
J. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, and E. Wu. 2017. BoostClean: Automated error detection and repair for machine learning. arXiv:1711.01299 (2017).
[29]
E. Livshits, L. Bertossi, B. Kimelfeld, and M. Sebag. 2020. The Shapley value of tuples in query answering. In Proc. ICDT. arXiv:1904.08679.
[30]
S. Lundberg and S.-I. Lee. 2017. A unified approach to interpreting model predictions. In Proc. NIPS.
[31]
A. Meliou, W. Gatterbauer, K. F. Moore, and D. Suciu. 2010. The complexity of causality and responsibility for query answers and non-answers. In Proc. VLDB.
[32]
J. Pearl. 2009. Causality: Models, Reasoning and Inference. Cambridge Univ. Press, 2nd ed.
[33]
J. Rammelaere and F. Geerts. 2018. Explaining repaired data with CFDs. In Proc. VLDB.
[34]
A. Roth (ed.). 1988. The Shapley Value: Essays in Honor of Lloyd S. Shapley. Cambridge University Press.
[35]
C. Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 5 (2019), 206--215. arXiv:1811.10154
[36]
P. Saleiro, B. Kuester, A. Stevens, A. Anisfeld, L. Hinkson, J. London, and R. Ghani. 2018. Aequitas: A bias and fairness audit toolkit. CoRR abs/1811.05577 (2018).
[37]
B. Salimi, L. Bertossi, D. Suciu, and G. Van den Broeck. 2016. Quantifying causal effects on query answering in databases. In Proc. TaPP.
[38]
B. Salimi, J. Gehrke, and D. Dan Suciu. 2018. Bias in OLAP queries: Detection, explanation, and removal. In Proc. SIGMOD. 1021--1035.
[39]
B. Salimi, B. Howe, and D. Suciu. 2019. Data management for causal algorithmic fairness. IEEE Data Engineering Bulletin 42, 3 (2019), 24--35.
[40]
D. Suciu, D. Olteanu, C. Re, and C. Koch. 2011. Probabilistic Databases. Synthesis Lectures on Data Management, Morgan 8 Claypool Publishers.

Cited By

View all
  • (2024)A multimodal database for the collection of interdisciplinary audiological research data in SpainUna base de datos multimodal para la recopilación de datos de investigación audiológicos interdisciplinaresAuditio10.51445/sja.auditio.vol8.2024.1098(e109)Online publication date: 27-Sep-2024
  • (2024)The Impact of Artificial Intelligence on Business Performance in Saudi Arabia: The Role of Technological Readiness and Data QualityEngineering, Technology & Applied Science Research10.48084/etasr.787114:5(16802-16807)Online publication date: 9-Oct-2024
  • (2024)Advanced Data Processing of Pancreatic Cancer Data Integrating Ontologies and Machine Learning Techniques to Create Holistic Health RecordsSensors10.3390/s2406173924:6(1739)Online publication date: 7-Mar-2024
  • Show More Cited By

Index Terms

  1. Data Quality and Explainable AI

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Journal of Data and Information Quality
    Journal of Data and Information Quality  Volume 12, Issue 2
    Special Issue on Quality Assessment of Knowledge Graphs and On the Horizon
    June 2020
    105 pages
    ISSN:1936-1955
    EISSN:1936-1963
    DOI:10.1145/3397186
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 May 2020
    Accepted: 01 March 2020
    Received: 01 March 2020
    Published in JDIQ Volume 12, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Machine learning
    2. bias
    3. causes
    4. fairness

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)287
    • Downloads (Last 6 weeks)29
    Reflects downloads up to 12 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A multimodal database for the collection of interdisciplinary audiological research data in SpainUna base de datos multimodal para la recopilación de datos de investigación audiológicos interdisciplinaresAuditio10.51445/sja.auditio.vol8.2024.1098(e109)Online publication date: 27-Sep-2024
    • (2024)The Impact of Artificial Intelligence on Business Performance in Saudi Arabia: The Role of Technological Readiness and Data QualityEngineering, Technology & Applied Science Research10.48084/etasr.787114:5(16802-16807)Online publication date: 9-Oct-2024
    • (2024)Advanced Data Processing of Pancreatic Cancer Data Integrating Ontologies and Machine Learning Techniques to Create Holistic Health RecordsSensors10.3390/s2406173924:6(1739)Online publication date: 7-Mar-2024
    • (2024)AI-Powered Data Governance: A Cutting-Edge Method for Ensuring Data Quality for Machine Learning Applications2024 Second International Conference on Emerging Trends in Information Technology and Engineering (ICETITE)10.1109/ic-ETITE58242.2024.10493601(1-6)Online publication date: 22-Feb-2024
    • (2024)High-Frequency Irreversible Electroporation: Optimum Parameter Prediction via Machine-LearningIEEE Journal of Electromagnetics, RF and Microwaves in Medicine and Biology10.1109/JERM.2024.33785738:3(220-228)Online publication date: Sep-2024
    • (2024)Shaping the Future of Data Ecosystem Research—What Is Still Missing?IEEE Access10.1109/ACCESS.2024.343296912(103162-103175)Online publication date: 2024
    • (2024)Artificial intelligence research: A review on dominant themes, methods, frameworks and future research directionsTelematics and Informatics Reports10.1016/j.teler.2024.10012714(100127)Online publication date: Jun-2024
    • (2024)A unified and practical user-centric framework for explainable artificial intelligenceKnowledge-Based Systems10.1016/j.knosys.2023.111107283(111107)Online publication date: Jan-2024
    • (2024)Towards development of functional climate-driven early warning systems for climate-sensitive infectious diseases: Statistical models and recommendationsEnvironmental Research10.1016/j.envres.2024.118568249(118568)Online publication date: May-2024
    • (2023)A Survey of Data Quality Requirements That Matter in ML Development PipelinesJournal of Data and Information Quality10.1145/359261615:2(1-39)Online publication date: 19-Apr-2023
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media