[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3514221.3517864acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

DataPrism: Exposing Disconnect between Data and Systems

Published: 11 June 2022 Publication History

Abstract

As data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of data. E.g., a health-monitoring system that is designed under the assumption that weight is reported in lbs will malfunction when encountering weight reported in kilograms. Like software debugging, which aims to find bugs in the source code or runtime conditions, our goal is to debug data to identify potential sources of disconnect between the assumptions about some data and systems that operate on that data. We propose DataPrism, a framework to identify data properties (profiles) that are the root causes of performance degradation or failure of a data-driven system. Such identification is necessary to repair data and resolve the disconnect between data and systems. Our technique is based on causal reasoning through interventions: when a system malfunctions for a dataset, DataPrism alters the data profiles and observes changes in the system's behavior due to the alteration. Unlike statistical observational analysis that reports mere correlations, DataPrism reports causally verified root causes -- in terms of data profiles -- of the system malfunction. We empirically evaluate DataPrism on seven real-world and several synthetic data-driven systems that fail on certain datasets due to a diverse set of reasons. In all cases, DataPrism identifies the root causes precisely while requiring orders of magnitude fewer interventions than prior techniques.

Supplemental Material

PDF File
Read me
ZIP File
Source Code

References

[1]
Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey. The VLDB Journal 24, 4 (2015), 557--581.
[2]
Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data profiling. Synthesis Lectures on Data Management 10, 4 (2018), 1--154.
[3]
AdaBoost Classifier. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
[4]
Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence Labeling. In COLING 2018, 27th International Conference on Computational Linguistics. 1638--1649.
[5]
Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-Ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software. In Proceedings of USENIX OSDI (Hollywood, CA, USA) (OSDI'12). USENIX Association, USA, 307--320.
[6]
Mona Attariyan and Jason Flinn. 2011. Automating Configuration Troubleshooting with ConfAid. ;login: 36, 1 (2011), 1--14.
[7]
Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. MacroBase: Prioritizing Attention in Fast Data. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD '17). ACM, New York, NY, USA, 541--556.
[8]
Daniel W Barowy, Emery D Berger, and Benjamin Zorn. 2018. ExceLint: Automatically finding spreadsheet formula errors. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), 1--26.
[9]
Daniel W. Barowy, Dimitar Gochev, and Emery D. Berger. 2014. CheckCell: data debugging for spreadsheets. In OOPSLA. 507--523.
[10]
Rachel KE Bellamy, Kuntal Dey, Michael Hind, Samuel C Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, et al . 2018. AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. arXiv preprint arXiv:1810.01943 (2018).
[11]
Bias in Amazon Hiring. https://becominghuman.ai/amazons-sexist-ai-recruiting-tool-how-did-it-go-so-wrong-e3d14816d98e
[12]
Mike Brachmann, Carlos Bautista, Sonia Castelo, Su Feng, Juliana Freire, Boris Glavic, Oliver Kennedy, Heiko Müeller, Rémi Rampin, William Spoth, et al. 2019. Data debugging and exploration with vizier. In Proceedings of the 2019 International Conference on Management of Data. 1877--1880.
[13]
Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2019. Data validation for machine learning. In Conference on Systems and Machine Learning (SysML). https://www. sysml. cc/doc/2019/167. pdf.
[14]
Gabriel Cadamuro, Ran Gilad-Bachrach, and Xiaojin Zhu. 2016. Debugging machine learning models. In ICML Workshop on Reliable Machine Learning in the Wild.
[15]
Cardiovascular Disease dataset. https://www.kaggle.com/sulianova/cardiovascular-disease-dataset
[16]
Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. On the discovery of relaxed functional dependencies. In Proceedings of the 20th International Database Engineering & Applications Symposium. 53--61.
[17]
Giuseppe Casalicchio, Christoph Molnar, and Bernd Bischl. 2018. Visualizing the feature importance for black box models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 655--670.
[18]
Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. 2002. Pinpoint: Problem Determination in Large, Dynamic Internet Services. In Proceedings of IEEE DSN (DSN'02). IEEE, USA, 595--604.
[19]
Fernando Chirigati, Harish Doraiswamy, Theodoros Damoulas, and Juliana Freire. 2016. Data Polygamy: The Many-Many Relationships Among Urban Spatio- Temporal Data Sets. In Proceedings of ACM SIGMOD (SIGMOD '16). ACM, New York, NY, USA, 1011--1025.
[20]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering Denial Constraints. PVLDB 6, 13 (2013), 1498--1509.
[21]
Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin Murphy, Shaohua Sun, and Wei Zhang. 2014. From Data Fusion to Knowledge Fusion. PVLDB 7, 10 (2014), 881--892.
[22]
Dingzhu Du, Frank K Hwang, and Frank Hwang. 2000. Combinatorial group testing and its applications. Vol. 12. World Scientific.
[23]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[24]
Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and Informative Explanations of Outcomes. PVLDB 8, 1 (Sept. 2014), 61--72.
[25]
Wenfei Fan, Floris Geerts, Laks V. S. Lakshmanan, and Ming Xiong. 2009. Discovering Conditional Functional Dependencies. In Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, March 29 2009 - April 2 2009, Shanghai, China. 1231--1234.
[26]
Anna Fariha, Suman Nath, and Alexandra Meliou. 2020. Causality-Guided Adaptive Interventional Debugging. In SIGMOD. 431--446.
[27]
Anna Fariha, Ashish Tiwari, Arjun Radhakrishna, Sumit Gulwani, and Alexandra Meliou. 2021. Conformance Constraint Discovery: Measuring Trust in Data- Driven Systems. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021. ACM, 499--512.
[28]
Gordon Fraser and Andrea Arcuri. 2013. Whole test suite generation. IEEE Transactions on Software Engineering 39, 2 (2013), 276--291.
[29]
Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou. 2017. Fairness testing: testing software for discrimination. In Proceedings of the 2017 11th Joint meeting on foundations of software engineering. 498--510.
[30]
Sainyam Galhotra, Anna Fariha, Raoni Lourenço, Juliana Freire, Alexandra Meliou, and Divesh Srivastava. 2021. DataPrism: Exposing Disconnect between Data and Systems. Technical Report. https://arxiv.org/abs/2105.06058.
[31]
Sainyam Galhotra, Udayan Khurana, Oktie Hassanzadeh, Kavitha Srinivas, Horst Samulowitz, and Miao Qi. 2019. Automated Feature Enhancement for Predictive Modeling using External Knowledge. In 2019 International Conference on Data Mining Workshops (ICDMW). IEEE, 1094--1097.
[32]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna M. Wallach, Hal Daumé III, and Kate Crawford. 2018. Datasheets for Datasets. CoRR abs/1803.09010 (2018). arXiv:1803.09010
[33]
Patrice Godefroid, Michael Y. Levin, and David A. Molnar. 2008. Automated whitebox fuzz testing. In Proceedings of NDSS. 151--166.
[34]
Google Vision Racism. https://algorithmwatch.org/en/story/google-vision-racism/
[35]
Muhammad Ali Gulzar, Siman Wang, and Miryung Kim. 2018. BigSift: Automated Debugging of Big Data Analytics in Data-Intensive Scalable Computing. In Proceedings of ESEC/FSE (Lake Buena Vista, FL, USA) (ESEC/FSE 2018). ACM, New York, NY, USA, 863--866.
[36]
Brent Hailpern and Padmanabhan Santhanam. 2002. Software debugging, testing, and verification. IBM Systems Journal 41, 1 (2002), 4--12.
[37]
Joseph M Hellerstein. 2008. Quantitative Data Cleaning for Large Databases. (2008).
[38]
Thomas A Henzinger, Ranjit Jhala, Rupak Majumdar, and Grégoire Sutre. 2003. Software verification with BLAST. In International SPIN Workshop on Model Checking of Software. Springer, 235--239.
[39]
Christian Holler, Kim Herzig, and Andreas Zeller. 2012. Fuzzing with code fragments. In Proceedings of USENIX Security Symposium. 445--458.
[40]
IBM AIF 360. https://aif360.mybluemix.net/
[41]
Ihab F Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf Aboulnaga. 2004. CORDS: automatic discovery of correlations and soft functional dependencies. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. 647--658.
[42]
IMDb Dataset. https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
[43]
Md Shahriar Iqbal, Rahul Krishna, Mohammad Ali Javidian, Baishakhi Ray, and Pooyan Jamshidi. [n.d.]. CADET: A Systematic Method For Debugging Miscon- figurations using Counterfactual Reasoning. ([n. d.]).
[44]
Is Amazon same-day delivery service racist? 2016. The Christian Science Monitor. https://www.csmonitor.com/Business/2016/0423/Is-Amazon-same-day-delivery-service-racist
[45]
Brittany Johnson, Yuriy Brun, and Alexandra Meliou. 2020. Causal Testing: Finding Defects' Root Causes. In ICSE.
[46]
Nick Koudas, Avishek Saha, Divesh Srivastava, and Suresh Venkatasubramanian. 2009. Metric functional dependencies. In 2009 IEEE 25th International Conference on Data Engineering. IEEE, 1275--1278.
[47]
Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015. Principles of explanatory debugging to personalize interactive machine learning. In Proceedings of the 20th international conference on intelligent user interfaces. 126--137.
[48]
Amresh Kumar, M Kiran, and BR Prathap. 2013. Verification and validation of mapreduce program model for parallel k-means algorithm on hadoop cluster. In 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT). IEEE, 1--8.
[49]
Philipp Langer and Felix Naumann. 2016. Efficient order dependency detection. VLDB J. 25, 2 (2016), 223--241.
[50]
Ben Liblit, Mayur Naik, Alice X Zheng, Alex Aiken, and Michael I Jordan. 2005. Scalable statistical bug isolation. Acm Sigplan Notices 40, 6 (2005), 15--26.
[51]
Haopeng Liu, Shan Lu, Madan Musuvathi, and Suman Nath. 2019. What bugs cause production cloud incidents?. In Proceedings of the Workshop on Hot Topics in Operating Systems, HotOS 2019, Bertinoro, Italy, May 13--15, 2019. 155--162.
[52]
Raoni Lourenço, Juliana Freire, and Dennis E. Shasha. 2020. BugDoc: Algorithms to Debug Computational Processes. In SIGMOD. 463--478.
[53]
Ali Mesbah, Arie Van Deursen, and Danny Roest. 2011. Invariant-based automatic testing of modern web applications. IEEE Transactions on Software Engineering 38, 1 (2011), 35--53.
[54]
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency (Jan 2019).
[55]
K?vanç Mu?lu, Yuriy Brun, and Alexandra Meliou. 2013. Data debugging with continuous testing. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. 631--634.
[56]
Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional dependency discovery: An experimental evaluation of seven algorithms. PLDB 8, 10 (2015), 1082--1093.
[57]
Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quiané-Ruiz, and Felix Naumann. 2015. Divide & conquer-based inclusion dependency discovery. PLDB 8, 7 (2015), 774--785.
[58]
Python Rexpy package. https://tdda.readthedocs.io/en/v1.0.30/rexpy.html
[59]
Random Forest Classifier. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
[60]
Kaivalya Rawal, Ece Kamar, and Himabindu Lakkaraju. 2020. Can I Still Trust You?: Understanding the Impact of Distribution Shifts on Algorithmic Recourses. arXiv preprint arXiv:2012.11788 (2020).
[61]
Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017. HoloClean: holistic data repairs with probabilistic inference. PVLDB 10, 11 (2017), 1190--1201.
[62]
El Kindi Rezig, Ashrita Brahmaroutu, Nesime Tatbul, Mourad Ouzzani, Nan Tang, Timothy G. Mattson, Samuel Madden, and Michael Stonebraker. 2020. Debugging Large-Scale Data Science Pipelines using Dagger. PVLDB 13, 12 (2020), 2993--2996.
[63]
El Kindi Rezig, Lei Cao, Giovanni Simonini, Maxime Schoemans, Samuel Madden, Nan Tang, Mourad Ouzzani, and Michael Stonebraker. 2020. Dagger: A Data (not code) Debugger. In CIDR 2020, 10th Conference on Innovative Data Systems Research, Amsterdam, The Netherlands, January 12--15, 2020, Online Proceedings.
[64]
El Kindi Rezig, Mourad Ouzzani, Walid G Aref, Ahmed K Elmagarmid, Ahmed R Mahmood, and Michael Stonebraker. 2021. Horizon: scalable dependency-driven data cleaning. PVLDB 14, 11 (2021), 2546--2554.
[65]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why should I trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1135--1144.
[66]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High- Precision Model-Agnostic Explanations. In AAAI, Vol. 18. 1527--1535.
[67]
Jeremias Rößler, Gordon Fraser, Andreas Zeller, and Alessandro Orso. 2012. Isolating failure causes through test case generation. In International Symposium on Software Testing and Analysis, ISSTA 2012, Minneapolis, MN, USA, July 15--20, 2012, Mats Per Erik Heimdahl and Zhendong Su (Eds.). ACM, 309--319.
[68]
Babak Salimi, Harsh Parikh, Moe Kayali, Lise Getoor, Sudeepa Roy, and Dan Suciu. 2020. Causal Relational Learning. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020. 241--256.
[69]
Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional Fairness: Causal Database Repair for Algorithmic Fairness. In SIGMOD. 793--810.
[70]
Sebastian Schelter, Tammo Rukat, and Felix Bießmann. 2020. Learning to Validate the Predictions of Black Box Classifiers on Unseen Data. In SIGMOD. 1289--1299.
[71]
Sentiment 140 dataset. https://www.kaggle.com/kazanova/sentiment140
[72]
Shaoxu Song and Lei Chen. 2011. Differential dependencies: Reasoning and discovery. ACM Transactions on Database Systems (TODS) 36, 3 (2011), 1--41.
[73]
Julia Stoyanovich and Bill Howe. 2019. Nutritional Labels for Data and Models. IEEE Data Eng. Bull. 42, 3 (2019), 13--23.
[74]
Paroma Varma, Dan Iter, Christopher De Sa, and Christopher Ré. 2017. Flipper: A systematic approach to debugging training sets. In Proceedings of the 2nd Workshop on Human-in-the-Loop Data Analytics. 1--5.
[75]
Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data X-Ray: A Diagnostic Tool for Data Errors. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. 1231--1245.
[76]
Weiyuan Wu, Lampros Flokas, Eugene Wu, and Jiannan Wang. 2020. Complaint-driven Training Data Debugging for Query 2.0. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1317--1334.
[77]
Jing Nathan Yan, Oliver Schulte, Mohan Zhang, Jiannan Wang, and Reynold Cheng. 2020. SCODED: Statistical Constraint Oriented Data Error Detection. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020. 845--860.
[78]
Andreas Zeller. 1999. Yesterday, My Program Worked. Today, It Does Not. Why?. In Software Engineering - ESEC/FSE'99, 7th European Software Engineering Conference, Held Jointly with the 7th ACM SIGSOFT Symposium on the Foundations of Software Engineering, Toulouse, France, September 1999, Proceedings. 253--267.
[79]
Alice X. Zheng, Michael I. Jordan, Ben Liblit, Mayur Naik, and Alex Aiken. 2006. Statistical Debugging: Simultaneous Identification of Multiple Bugs. In Proceedings of ICML (Pittsburgh, Pennsylvania, USA) (ICML'06). ACM, New York, NY, USA, 1105--1112.

Cited By

View all
  • (2024)Fainder: A Fast and Accurate Index for Distribution-Aware Dataset SearchProceedings of the VLDB Endowment10.14778/3681954.368199917:11(3269-3282)Online publication date: 30-Aug-2024
  • (2024)Counterfactual Explanation at Will, with Zero Privacy LeakageProceedings of the ACM on Management of Data10.1145/36549332:3(1-29)Online publication date: 30-May-2024
  • (2024)Inferring Data Preconditions from Deep Learning Models for Trustworthy Prediction in DeploymentProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623333(1-13)Online publication date: 20-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data
June 2022
2597 pages
ISBN:9781450392495
DOI:10.1145/3514221
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. causal testing
  2. data profiles
  3. debugging
  4. root-cause identification

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)379
  • Downloads (Last 6 weeks)19
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Fainder: A Fast and Accurate Index for Distribution-Aware Dataset SearchProceedings of the VLDB Endowment10.14778/3681954.368199917:11(3269-3282)Online publication date: 30-Aug-2024
  • (2024)Counterfactual Explanation at Will, with Zero Privacy LeakageProceedings of the ACM on Management of Data10.1145/36549332:3(1-29)Online publication date: 30-May-2024
  • (2024)Inferring Data Preconditions from Deep Learning Models for Trustworthy Prediction in DeploymentProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623333(1-13)Online publication date: 20-May-2024
  • (2023)On Irregularity Localization for Scientific Data Analysis WorkflowsComputational Science – ICCS 202310.1007/978-3-031-35995-8_24(336-351)Online publication date: 3-Jul-2023
  • (2022)Outcome-Preserving Input Reduction for Scientific Data Analysis WorkflowsProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering10.1145/3551349.3559558(1-5)Online publication date: 10-Oct-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media