[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3522664.3528590acmconferencesArticle/Chapter ViewAbstractPublication PagescainConference Proceedingsconference-collections
research-article
Open access

Data smells: categories, causes and consequences, and detection of suspicious data in AI-based systems

Published: 17 October 2022 Publication History

Abstract

High data quality is fundamental for today's AI-based systems. However, although data quality has been an object of research for decades, there is a clear lack of research on potential data quality issues (e.g., ambiguous, extraneous values). These kinds of issues are latent in nature and thus often not obvious. Nevertheless, they can be associated with an increased risk of future problems in AI-based systems (e.g., technical debt, data-induced faults). As a counterpart to code smells in software engineering, we refer to such issues as Data Smells. This article conceptualizes data smells and elaborates on their causes, consequences, detection, and use in the context of AI-based systems. In addition, a catalogue of 36 data smells divided into three categories (i.e., Believability Smells, Understandability Smells, Consistency Smells) is presented. Moreover, the article outlines tool support for detecting data smells and presents the result of an initial smell detection on more than 240 real-world datasets.

References

[1]
Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting data errors: Where are we and what needs to be done? Proceedings of the VLDB Endowment 9, 12 (2016), 993--1004.
[2]
Amina Adadi. 2021. A survey on data-efficient algorithms in big data era. Journal of Big Data 8, 1 (2021).
[3]
Daniel W. Barowy, Dimitar Gochev, and Emery D. Berger. 2014. CheckCell. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, Andrew Black and Todd Millstein (Eds.). ACM, 507--523.
[4]
Carlo Batini, Monica Scannapieco, et al. 2016. Data and information quality. Springer, Cham, Switzerland.
[5]
Felix Biessmann, Jacek Golebiowski, Tammo Rukat, Dustin Lange, and Philipp Schmidt. 2021. Automated Data Validation in Machine Learning Systems. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 44, 1 (2021), 51--65.
[6]
Justus Bogner, Roberto Verdecchia, and Ilias Gerostathopoulos. 2021. Characterizing Technical Debt and Antipatterns in AI-Based Systems: A Systematic Mapping Study. arXiv preprint arXiv:2103.09783 (2021).
[7]
Jan Bosch, Helena Holmström Olsson, and Ivica Crnkovic. 2021. Engineering AI systems: A research agenda. In Artificial Intelligence Paradigms for Smart Cyber-Physical Systems. IGI Global, 1--19.
[8]
Houssem Ben Braiek and Foutse Khomh. 2020. On Testing Machine Learning Programs. Journal of Systems and Software 164 (2020), 110542.
[9]
Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. In Proceedings of the 2nd SysML Conference.
[10]
Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data infrastructure for machine learning. In SysML Conference.
[11]
Emily Caveness, Paul Suganthan GC, Zhuo Peng, Neoklis Polyzotis, Sudip Roy, and Martin Zinkevich. 2020. Tensorflow data validation: Data analysis and validation in continuous ml pipelines. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2793--2796.
[12]
coronadatascraper. 2020. Data has a gap between 2020-3-11 and 2020-3-24 #375: GitHub. https://github.com/covidatlas/coronadatascraper/issues/375
[13]
covid19india react. 2020. Rajasthan District names are wrong #321: GitHub. https://github.com/covid19india/covid19india-react/issues/321
[14]
Thomas H. Davenport. 2015. Lessons-from-the-Cognitive-Front-Lines-Early-Adopters-of-IBMs-Watson. The Wall Street Journal 1 (2015). https://www.tomdavenport.com/wp-content/uploads/2019/01/Lessons-from-the-Cognitive-Front-Lines-Early-Adopters-of-IBMs-Watson.pdf
[15]
Mike Dreves, Gene Huang, Zhuo Peng, Neoklis Polyzotis, Evan Rosen, and Paul Suganthan GC. 2021. Validating Data and Models in Continuous ML pipelines. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 44, 1 (2021), 42--50.
[16]
Lisa Ehrlinger, Thomas Grubinger, Bence Varga, Mario Pichler, Thomas Natschlager, and Jürgen Zeindl. 2018. Treating missing data in industrial data analytics. In Thirteenth International Conference on Digital Information Management (ICDIM). 148--155.
[17]
Harald Foidl and Michael Felderer. 2019. Risk-based data validation in machine learning-based software systems. In Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation. 13--18.
[18]
Ralph Foorthuis. 2018. A Typology of Data Anomalies. In Information Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations, Jesús Medina, Manuel Ojeda-Aciego, José Luis Verdegay, David A. Pelta, Inma P. Cabrera, Bernadette Bouchon-Meunier, and Ronald R. Yager (Eds.). Communications in Computer and Information Science, Vol. 854. Springer International Publishing, Cham, 26--38.
[19]
Martin Fowler, Kent Beck, J. Brant, W. Opdyke, and D. Roberts. 1999. Refactoring: improving the design of existing code, ser. In Addison Wesley object technology series. Addison-Wesley.
[20]
S. G. Ganesh, Tushar Sharma, and Girish Suryanarayana. 2013. Towards a Principle-based Classification of Structural Design Smells. The Journal of Object Technology 12, 2 (2013), 1--29.
[21]
Vahid Garousi, Michael Felderer, and Mika V. Mäntylä. 2019. Guidelines for including grey literature and conducting multivocal literature reviews in software engineering. Information and Software Technology 106 (2019), 101--121.
[22]
Mouzhi Ge and Markus Helfert. 2007. A review of Information Quality Research: Develop a Research Agenda. In Proceedings of the 2007 MIT International Conference on Information Quality (ICIQ).
[23]
Great Expectations. 2022. https://greatexpectations.io/
[24]
Nitin Gupta, Hima Patel, Shazia Afzal, Naveen Panwar, Ruhi Sharma Mittal, Shanmukha Guttula, Abhinav Jain, Lokesh Nagalapatti, Sameep Mehta, Sandeep Hans, et al. 2021. Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets. arXiv preprint arXiv:2108.05935 (2021).
[25]
Jacob Harris. 2014. Distrust Your Data. https://source.opennews.org/articles/distrust-your-data/
[26]
Nick Hynes, D. Sculley, and Michael Terry. 2017. The data linter: Lightweight, automated sanity checking for ml data sets. In 31st Conference on Neural Information Processing Systems (NIPS): Workshop on ML Systems.
[27]
Ihab F. Ilyas and Xu Chu. 2019. Data cleaning. Morgan & Claypool.
[28]
Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 510--520.
[29]
Nikolas Iubel. 2014. Ensuring Accuracy in Data Journalism. https://github.com/nikeiubel/data-smells/wiki/Ensuring-Accuracy-in-Data-Journalism
[30]
Jácome Cunha, João P. Fernandes, Hugo Ribeiro, and João Saraiva. 2012. Towards a Catalog of Spreadsheet Smells. International Conference on Computational Science and Its Applications (2012), 202--216.
[31]
Joao Marcelo Borovina Josko, Lisa Ehrlinger, and Wolfram Wöß. 2019. Towards a Knowledge Graph to Describe and Process Data Defects. In DBKDA 2019: The Eleventh International Conference on Advances in Databases, Knowledge, and Data Applications. 57--60.
[32]
Mark Kasunic, James McCurley, Dennis Goldenson, and David Zubrow. 2011. An Investigation of Techniques for Detecting Data Anomalies in Earned Value Management Data.
[33]
Won Kim, Byoung-Ju Choi, Eui-Kyeong Hong, Soo-Kyung Kim, and Doheon Lee. 2003. A Taxonomy of Dirty Data. Data Mining and Knowledge Discovery 7, 1 (2003), 81--99.
[34]
Grace A. Lewis, Stephany Bellomo, and Ipek Ozkaya. 2021. Characterizing and Detecting Mismatch in Machine-Learning-Enabled Systems. arXiv preprint arXiv:2103.14101 (2021).
[35]
Lin Li, Taoxin Peng, and Jessie Kennedy. 2011. Arule based taxonomy of dirty data. GSTF Journal on Computing (JoC) 1, 2 (2011).
[36]
Steve Lohr. 2021. What Ever Happened to IBM's Watson? https://www.nytimes.com/2021/07/16/technology/what-happened-ibm-watson.html
[37]
Lucy Ellen Lwakatare, Ellinor Rånge, Ivica Crnkovic, and Jan Bosch. 2021. On the experiences of adopting automated data validation in an industrial machine learning project. In IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 248--257.
[38]
Silverio Martínez-Fernández, Justus Bogner, Xavier Franch, Marc Oriol, Julien Siebert, Adam Trendowicz, Anna Maria Vollmer, and Stefan Wagner. 2021. Software Engineering for AI-Based Systems: A Survey. arXiv preprint arXiv:2105.01984 (2021).
[39]
Jorge Merino, Xiang Xie, Ian Lewis, Ajith Parlikad, and Duncan McFarlane. 2020. Impact of Data Quality in Real-Time Big Data Systems. In CEUR Workshop Proceedings. Vol. 2716. 73--86.
[40]
MobyDQ. 2022. https://ubisoft.github.io/mobydq/
[41]
Paulo Oliveira, Fatima Rodrigues, Pedro Henriques, and Helena Galhardas. 2005. A taxonomy of data quality problems. In 2nd Int. Workshop on Data and Information Quality. 219--233.
[42]
Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. 2021. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns 2, 11 (2021), 100336.
[43]
Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data lifecycle challenges in production machine learning: a survey. ACM SIGMOD Record 47, 2 (2018), 17--28.
[44]
Neoklis Polyzotis and Matei Zaharia. 2021. What can Data-Centric AI Learn from Data and ML Engineering? arXiv preprint arXiv:2112.06439 (2021).
[45]
Akond Ashfaque Ur Rahman and Effat Farhana. 2021. An Empirical Study of Bugs in COVID-19 Software Projects. Journal of Software Engineering Research and Development 9 (2021).
[46]
redgate. 2017. SQLCode Smells. http://assets.red-gate.com/community/books/sql-code-smells.pdf
[47]
Sergey Redyuk, Zoi Kaoudi, Volker Markl, and Sebastian Schelter. 2021. Automating Data Quality Validation for Dynamic Data Ingestion. In EDBT. 61--72.
[48]
Casey Ross and Ike Swetlitz. 2017. IBM pitched its Watson supercomputer as a revolution in cancer care. It's nowhere close. (2017).
[49]
Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M. Aroyo. 2021. "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI 2021. 1--15.
[50]
José Amancio M. Santos, João B. Rocha-Junior, Luciana Carla Lins Prates, Rogeres Santos do Nascimento, Mydiã Falcão Freitas, and Manoel Gomes de Mendonça. 2018. A systematic review on the code smell effect. Journal of Systems and Software 144 (2018), 450--477.
[51]
Sebastian Schelter, Stefan Grafberger, Philipp Schmidt, Tammo Rukat, Mario Kiessling, Andrey Taptunov, Felix Biessmann, and Dustin Lange. 2018. Deequdata quality validation for machine learning pipelines. In Machine Learning Systems workshop at the conference on Neural Information Processing Systems (NeurIPS).
[52]
Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proceedings of the VLDB Endowment 11, 12 (2018), 1781--1794.
[53]
Sebastian Schelter, Tammo Rukat, and Felix Biessmann. 2021. JENGA - A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models. In 24th International Conference on Extending Database Technology (EDBT), March 23--26, 2021. 529--534.
[54]
David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. Advances in neural information processing systems 28 (2015), 2503--2511.
[55]
Tushar Sharma, Marios Fragkoulis, Stamatia Rizou, Magiel Bruntink, and Diomidis Spinellis. 2018. Smelly relations: measuring and understanding database schema quality. In Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice. 55--64.
[56]
Shrey Shrivastava, Dhaval Patel, Anuradha Bhamidipaty, Wesley M. Gifford, Stuart A. Siegel, Venkata Sitaramagiridharganesh Ganapavarapu, and Jayant R. Kalagnanam. 2019. Dqa: Scalable, automated and interactive data quality advisor. In 2019 IEEE International Conference on Big Data. 2913--2922.
[57]
Dina Sukhobok, Nikolay Nikolov, and Dumitru Roman. 2017. Tabular Data Anomaly Patterns. In International Conference on Big Data Innovations and Applications (Innovate-Data), 21-23 August 2017. 25--34.
[58]
Arun Swami, Sriram Vasudevan, and Joojay Huyn. 2020. Data sentinel: A declarative production-scale data validation platform. In IEEE 36th International Conference on Data Engineering (ICDE). 1579--1590.
[59]
Koichi Takahashi, Hagop M. Kantarjian, Guillermo Garcia-Manero, Rick J. Stevens, Courtney Denton Dinardo, Joshua Allen, Emily Hardeman, Scott Carrier, Cynthia Powers, Pat Keane, Sherry Pierce, Mark Routbort, Thai Nguyen, Brett Smith, Jeffery Frey, Keith Perry, John C. Frenzel, Rob High, Andrew Futreal, and Lynda Chin. 2014. MD Anderson's Oncology Expert Advisor powered by IBM Watson: A Web-based cognitive clinical decision support tool. Journal of Clinical Oncology 32, 15_suppl (2014), 6506.
[60]
Eva van Emden and Leon M oonen. 2012. Assuring software quality by code smell detection. In 19th Working Conference on Reverse Engineering. IEEE, xix.
[61]
Zehao Wang. 2021. Understanding the Challenges and Assisting Developers with Developing Spark Applications. In IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 132--134.
[62]
Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering - EASE '14, Martin Shepperd, Tracy Hall, and Ingunn Myrtveit (Eds.). ACM Press, New York, New York, USA, 1--10.
[63]
Chenyang Yang, Shurui Zhou, Jin L. C. Guo, and Christian Kastner. 2021. Subtle bugs everywhere: Generating documentation for data wrangling code. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, Vol. 11.

Cited By

View all
  • (2024)Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning SoftwareProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3688201(155-161)Online publication date: 22-Sep-2024
  • (2024)PLUS: A Semi-automated Pipeline for Fraud Detection in Public BidsDigital Government: Research and Practice10.1145/36163965:1(1-16)Online publication date: 12-Mar-2024
  • (2024)Data Guards: Challenges and Solutions for Fostering Trust in Data2024 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS55277.2024.00019(56-60)Online publication date: 13-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CAIN '22: Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI
May 2022
254 pages
ISBN:9781450392754
DOI:10.1145/3522664
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IEEE TCSC: IEEE Technical Committee on Scalable Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

  • State of Upper Austria
  • Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology
  • Austrian Federal Ministry for Digital and Economic Affairs

Conference

CAIN '22
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)606
  • Downloads (Last 6 weeks)54
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning SoftwareProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3688201(155-161)Online publication date: 22-Sep-2024
  • (2024)PLUS: A Semi-automated Pipeline for Fraud Detection in Public BidsDigital Government: Research and Practice10.1145/36163965:1(1-16)Online publication date: 12-Mar-2024
  • (2024)Data Guards: Challenges and Solutions for Fostering Trust in Data2024 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS55277.2024.00019(56-60)Online publication date: 13-Oct-2024
  • (2024)A Multivocal Mapping Study of MongoDB Smells2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00086(792-803)Online publication date: 12-Mar-2024
  • (2024)An Identification and Analysis of Emotion Via Number of Visits and PR In Designing ML Models2024 4th International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE)10.1109/ICACITE60783.2024.10617396(1252-1256)Online publication date: 14-May-2024
  • (2024)Technical debt in AI-enabled systems: On the prevalence, severity, impact, and management strategies for code and architectureJournal of Systems and Software10.1016/j.jss.2024.112151216(112151)Online publication date: Oct-2024
  • (2024)Enterprise architecture-based metamodel for machine learning projects and its managementFuture Generation Computer Systems10.1016/j.future.2024.06.062161(135-145)Online publication date: Dec-2024
  • (2024)Insights into commonalities of a sample: A visualization framework to explore unusual subset-dataset relationshipsData & Knowledge Engineering10.1016/j.datak.2024.102299151(102299)Online publication date: May-2024
  • (2023)What Do Users Ask in Open-Source AI Repositories? An Empirical Study of GitHub Issues2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00024(79-91)Online publication date: May-2023
  • (2023)Algorithm Debt: Challenges and Future Paths2023 IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN)10.1109/CAIN58948.2023.00020(90-91)Online publication date: May-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media