More Web Proxy on the site http://driver.im/

research-article

Open access

Data smells: categories, causes and consequences, and detection of suspicious data in AI-based systems

Authors:

Michael Felderer,

Rudolf RamlerAuthors Info & Claims

CAIN '22: Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI

Pages 229 - 239

https://doi.org/10.1145/3522664.3528590

Published: 17 October 2022 Publication History

Abstract

High data quality is fundamental for today's AI-based systems. However, although data quality has been an object of research for decades, there is a clear lack of research on potential data quality issues (e.g., ambiguous, extraneous values). These kinds of issues are latent in nature and thus often not obvious. Nevertheless, they can be associated with an increased risk of future problems in AI-based systems (e.g., technical debt, data-induced faults). As a counterpart to code smells in software engineering, we refer to such issues as Data Smells. This article conceptualizes data smells and elaborates on their causes, consequences, detection, and use in the context of AI-based systems. In addition, a catalogue of 36 data smells divided into three categories (i.e., Believability Smells, Understandability Smells, Consistency Smells) is presented. Moreover, the article outlines tool support for detecting data smells and presents the result of an initial smell detection on more than 240 real-world datasets.

References

[1]

Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting data errors: Where are we and what needs to be done? Proceedings of the VLDB Endowment 9, 12 (2016), 993--1004.

Digital Library

[2]

Amina Adadi. 2021. A survey on data-efficient algorithms in big data era. Journal of Big Data 8, 1 (2021).

[3]

Daniel W. Barowy, Dimitar Gochev, and Emery D. Berger. 2014. CheckCell. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, Andrew Black and Todd Millstein (Eds.). ACM, 507--523.

[4]

Carlo Batini, Monica Scannapieco, et al. 2016. Data and information quality. Springer, Cham, Switzerland.

[5]

Felix Biessmann, Jacek Golebiowski, Tammo Rukat, Dustin Lange, and Philipp Schmidt. 2021. Automated Data Validation in Machine Learning Systems. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 44, 1 (2021), 51--65.

[6]

Justus Bogner, Roberto Verdecchia, and Ilias Gerostathopoulos. 2021. Characterizing Technical Debt and Antipatterns in AI-Based Systems: A Systematic Mapping Study. arXiv preprint arXiv:2103.09783 (2021).

[7]

Jan Bosch, Helena Holmström Olsson, and Ivica Crnkovic. 2021. Engineering AI systems: A research agenda. In Artificial Intelligence Paradigms for Smart Cyber-Physical Systems. IGI Global, 1--19.

[8]

Houssem Ben Braiek and Foutse Khomh. 2020. On Testing Machine Learning Programs. Journal of Systems and Software 164 (2020), 110542.

[9]

Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. In Proceedings of the 2nd SysML Conference.

[10]

Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data infrastructure for machine learning. In SysML Conference.

[11]

Emily Caveness, Paul Suganthan GC, Zhuo Peng, Neoklis Polyzotis, Sudip Roy, and Martin Zinkevich. 2020. Tensorflow data validation: Data analysis and validation in continuous ml pipelines. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2793--2796.

Digital Library

[12]

coronadatascraper. 2020. Data has a gap between 2020-3-11 and 2020-3-24 #375: GitHub. https://github.com/covidatlas/coronadatascraper/issues/375

[13]

covid19india react. 2020. Rajasthan District names are wrong #321: GitHub. https://github.com/covid19india/covid19india-react/issues/321

[14]

Thomas H. Davenport. 2015. Lessons-from-the-Cognitive-Front-Lines-Early-Adopters-of-IBMs-Watson. The Wall Street Journal 1 (2015). https://www.tomdavenport.com/wp-content/uploads/2019/01/Lessons-from-the-Cognitive-Front-Lines-Early-Adopters-of-IBMs-Watson.pdf

[15]

Mike Dreves, Gene Huang, Zhuo Peng, Neoklis Polyzotis, Evan Rosen, and Paul Suganthan GC. 2021. Validating Data and Models in Continuous ML pipelines. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 44, 1 (2021), 42--50.

[16]

Lisa Ehrlinger, Thomas Grubinger, Bence Varga, Mario Pichler, Thomas Natschlager, and Jürgen Zeindl. 2018. Treating missing data in industrial data analytics. In Thirteenth International Conference on Digital Information Management (ICDIM). 148--155.

[17]

Harald Foidl and Michael Felderer. 2019. Risk-based data validation in machine learning-based software systems. In Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation. 13--18.

Digital Library

[18]

Ralph Foorthuis. 2018. A Typology of Data Anomalies. In Information Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations, Jesús Medina, Manuel Ojeda-Aciego, José Luis Verdegay, David A. Pelta, Inma P. Cabrera, Bernadette Bouchon-Meunier, and Ronald R. Yager (Eds.). Communications in Computer and Information Science, Vol. 854. Springer International Publishing, Cham, 26--38.

[19]

Martin Fowler, Kent Beck, J. Brant, W. Opdyke, and D. Roberts. 1999. Refactoring: improving the design of existing code, ser. In Addison Wesley object technology series. Addison-Wesley.

[20]

S. G. Ganesh, Tushar Sharma, and Girish Suryanarayana. 2013. Towards a Principle-based Classification of Structural Design Smells. The Journal of Object Technology 12, 2 (2013), 1--29.

[21]

Vahid Garousi, Michael Felderer, and Mika V. Mäntylä. 2019. Guidelines for including grey literature and conducting multivocal literature reviews in software engineering. Information and Software Technology 106 (2019), 101--121.

[22]

Mouzhi Ge and Markus Helfert. 2007. A review of Information Quality Research: Develop a Research Agenda. In Proceedings of the 2007 MIT International Conference on Information Quality (ICIQ).

[23]

Great Expectations. 2022. https://greatexpectations.io/

[24]

Nitin Gupta, Hima Patel, Shazia Afzal, Naveen Panwar, Ruhi Sharma Mittal, Shanmukha Guttula, Abhinav Jain, Lokesh Nagalapatti, Sameep Mehta, Sandeep Hans, et al. 2021. Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets. arXiv preprint arXiv:2108.05935 (2021).

[25]

Jacob Harris. 2014. Distrust Your Data. https://source.opennews.org/articles/distrust-your-data/

[26]

Nick Hynes, D. Sculley, and Michael Terry. 2017. The data linter: Lightweight, automated sanity checking for ml data sets. In 31st Conference on Neural Information Processing Systems (NIPS): Workshop on ML Systems.

[27]

Ihab F. Ilyas and Xu Chu. 2019. Data cleaning. Morgan & Claypool.

Digital Library

[28]

Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 510--520.

Digital Library

[29]

Nikolas Iubel. 2014. Ensuring Accuracy in Data Journalism. https://github.com/nikeiubel/data-smells/wiki/Ensuring-Accuracy-in-Data-Journalism

[30]

Jácome Cunha, João P. Fernandes, Hugo Ribeiro, and João Saraiva. 2012. Towards a Catalog of Spreadsheet Smells. International Conference on Computational Science and Its Applications (2012), 202--216.

Digital Library

[31]

Joao Marcelo Borovina Josko, Lisa Ehrlinger, and Wolfram Wöß. 2019. Towards a Knowledge Graph to Describe and Process Data Defects. In DBKDA 2019: The Eleventh International Conference on Advances in Databases, Knowledge, and Data Applications. 57--60.

[32]

Mark Kasunic, James McCurley, Dennis Goldenson, and David Zubrow. 2011. An Investigation of Techniques for Detecting Data Anomalies in Earned Value Management Data.

[33]

Won Kim, Byoung-Ju Choi, Eui-Kyeong Hong, Soo-Kyung Kim, and Doheon Lee. 2003. A Taxonomy of Dirty Data. Data Mining and Knowledge Discovery 7, 1 (2003), 81--99.

Digital Library

[34]

Grace A. Lewis, Stephany Bellomo, and Ipek Ozkaya. 2021. Characterizing and Detecting Mismatch in Machine-Learning-Enabled Systems. arXiv preprint arXiv:2103.14101 (2021).

[35]

Lin Li, Taoxin Peng, and Jessie Kennedy. 2011. Arule based taxonomy of dirty data. GSTF Journal on Computing (JoC) 1, 2 (2011).

[36]

Steve Lohr. 2021. What Ever Happened to IBM's Watson? https://www.nytimes.com/2021/07/16/technology/what-happened-ibm-watson.html

[37]

Lucy Ellen Lwakatare, Ellinor Rånge, Ivica Crnkovic, and Jan Bosch. 2021. On the experiences of adopting automated data validation in an industrial machine learning project. In IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 248--257.

Digital Library

[38]

Silverio Martínez-Fernández, Justus Bogner, Xavier Franch, Marc Oriol, Julien Siebert, Adam Trendowicz, Anna Maria Vollmer, and Stefan Wagner. 2021. Software Engineering for AI-Based Systems: A Survey. arXiv preprint arXiv:2105.01984 (2021).

[39]

Jorge Merino, Xiang Xie, Ian Lewis, Ajith Parlikad, and Duncan McFarlane. 2020. Impact of Data Quality in Real-Time Big Data Systems. In CEUR Workshop Proceedings. Vol. 2716. 73--86.

[40]

MobyDQ. 2022. https://ubisoft.github.io/mobydq/

[41]

Paulo Oliveira, Fatima Rodrigues, Pedro Henriques, and Helena Galhardas. 2005. A taxonomy of data quality problems. In 2nd Int. Workshop on Data and Information Quality. 219--233.

[42]

Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. 2021. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns 2, 11 (2021), 100336.

[43]

Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data lifecycle challenges in production machine learning: a survey. ACM SIGMOD Record 47, 2 (2018), 17--28.

Digital Library

[44]

Neoklis Polyzotis and Matei Zaharia. 2021. What can Data-Centric AI Learn from Data and ML Engineering? arXiv preprint arXiv:2112.06439 (2021).

[45]

Akond Ashfaque Ur Rahman and Effat Farhana. 2021. An Empirical Study of Bugs in COVID-19 Software Projects. Journal of Software Engineering Research and Development 9 (2021).

[46]

redgate. 2017. SQLCode Smells. http://assets.red-gate.com/community/books/sql-code-smells.pdf

[47]

Sergey Redyuk, Zoi Kaoudi, Volker Markl, and Sebastian Schelter. 2021. Automating Data Quality Validation for Dynamic Data Ingestion. In EDBT. 61--72.

[48]

Casey Ross and Ike Swetlitz. 2017. IBM pitched its Watson supercomputer as a revolution in cancer care. It's nowhere close. (2017).

[49]

Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M. Aroyo. 2021. "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI 2021. 1--15.

[50]

José Amancio M. Santos, João B. Rocha-Junior, Luciana Carla Lins Prates, Rogeres Santos do Nascimento, Mydiã Falcão Freitas, and Manoel Gomes de Mendonça. 2018. A systematic review on the code smell effect. Journal of Systems and Software 144 (2018), 450--477.

Digital Library

[51]

Sebastian Schelter, Stefan Grafberger, Philipp Schmidt, Tammo Rukat, Mario Kiessling, Andrey Taptunov, Felix Biessmann, and Dustin Lange. 2018. Deequdata quality validation for machine learning pipelines. In Machine Learning Systems workshop at the conference on Neural Information Processing Systems (NeurIPS).

[52]

Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proceedings of the VLDB Endowment 11, 12 (2018), 1781--1794.

Digital Library

[53]

Sebastian Schelter, Tammo Rukat, and Felix Biessmann. 2021. JENGA - A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models. In 24th International Conference on Extending Database Technology (EDBT), March 23--26, 2021. 529--534.

[54]

David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. Advances in neural information processing systems 28 (2015), 2503--2511.

[55]

Tushar Sharma, Marios Fragkoulis, Stamatia Rizou, Magiel Bruntink, and Diomidis Spinellis. 2018. Smelly relations: measuring and understanding database schema quality. In Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice. 55--64.

Digital Library

[56]

Shrey Shrivastava, Dhaval Patel, Anuradha Bhamidipaty, Wesley M. Gifford, Stuart A. Siegel, Venkata Sitaramagiridharganesh Ganapavarapu, and Jayant R. Kalagnanam. 2019. Dqa: Scalable, automated and interactive data quality advisor. In 2019 IEEE International Conference on Big Data. 2913--2922.

[57]

Dina Sukhobok, Nikolay Nikolov, and Dumitru Roman. 2017. Tabular Data Anomaly Patterns. In International Conference on Big Data Innovations and Applications (Innovate-Data), 21-23 August 2017. 25--34.

[58]

Arun Swami, Sriram Vasudevan, and Joojay Huyn. 2020. Data sentinel: A declarative production-scale data validation platform. In IEEE 36th International Conference on Data Engineering (ICDE). 1579--1590.

[59]

Koichi Takahashi, Hagop M. Kantarjian, Guillermo Garcia-Manero, Rick J. Stevens, Courtney Denton Dinardo, Joshua Allen, Emily Hardeman, Scott Carrier, Cynthia Powers, Pat Keane, Sherry Pierce, Mark Routbort, Thai Nguyen, Brett Smith, Jeffery Frey, Keith Perry, John C. Frenzel, Rob High, Andrew Futreal, and Lynda Chin. 2014. MD Anderson's Oncology Expert Advisor powered by IBM Watson: A Web-based cognitive clinical decision support tool. Journal of Clinical Oncology 32, 15_suppl (2014), 6506.

[60]

Eva van Emden and Leon M oonen. 2012. Assuring software quality by code smell detection. In 19th Working Conference on Reverse Engineering. IEEE, xix.

Digital Library

[61]

Zehao Wang. 2021. Understanding the Challenges and Assisting Developers with Developing Spark Applications. In IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 132--134.

[62]

Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering - EASE '14, Martin Shepperd, Tracy Hall, and Ingunn Myrtveit (Eds.). ACM Press, New York, New York, USA, 1--10.

Digital Library

[63]

Chenyang Yang, Shurui Zhou, Jin L. C. Guo, and Christian Kastner. 2021. Subtle bugs everywhere: Generating documentation for data wrangling code. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, Vol. 11.

Digital Library

Cited By

Meijer WCombemale BWimmer MChechik MEgyed A(2024)Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning SoftwareProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3688201(155-161)Online publication date: 22-Sep-2024
https://dl.acm.org/doi/10.1145/3652620.3688201
Brandão MReis AMendes BDe Almeida COliveira GHott HGomide LCosta LSilva MLacerda APappa G(2024)PLUS: A Semi-automated Pipeline for Fraud Detection in Public BidsDigital Government: Research and Practice10.1145/36163965:1(1-16)Online publication date: 12-Mar-2024
https://dl.acm.org/doi/10.1145/3616396
Sultanum NBromley DCorrell M(2024)Data Guards: Challenges and Solutions for Fostering Trust in Data2024 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS55277.2024.00019(56-60)Online publication date: 13-Oct-2024
https://doi.org/10.1109/VIS55277.2024.00019
Show More Cited By

Recommendations

Are architectural smells independent from code smells? An empirical study
Highlights
- Case study analyzing the correlations among code smells, groups of code smells and architectural smells.
Abstract
Background. Architectural smells and code smells are symptoms of bad code or design that can cause different quality problems, such as faults, technical debt, or difficulties with maintenance and evolution. Some studies ...
Prioritising Refactoring Using Code Bad Smells
ICSTW '11: Proceedings of the 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops

We investigated the relationship between six of Fowler et al.'s Code Bad Smells (Duplicated Code, Data Clumps, Switch Statements, Speculative Generality, Message Chains, and Middle Man) and software faults. In this paper we discuss how our results can ...
On the Prevalence, Impact, and Evolution of SQL Code Smells in Data-Intensive Systems
MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories

Code smells indicate software design problems that harm software quality. Data-intensive systems that frequently access databases often suffer from SQL code smells besides the traditional smells. While there have been extensive studies on traditional ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CAIN '22: Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI

May 2022

254 pages

ISBN:9781450392754

DOI:10.1145/3522664

General Chair:
Ivica Crnkovic
Chalmers University of Technology, SE

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

IEEE TCSC: IEEE Technical Committee on Scalable Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

State of Upper Austria
Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology
Austrian Federal Ministry for Digital and Economic Affairs

Conference

CAIN '22

Sponsor:

SIGSOFT

CAIN '22: 1st Conference on AI Engineering - Software Engineering for AI

May 16 - 24, 2022

Pennsylvania, Pittsburgh

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
1,048
Total Downloads

Downloads (Last 12 months)606
Downloads (Last 6 weeks)54

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Meijer WCombemale BWimmer MChechik MEgyed A(2024)Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning SoftwareProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3688201(155-161)Online publication date: 22-Sep-2024
https://dl.acm.org/doi/10.1145/3652620.3688201
Brandão MReis AMendes BDe Almeida COliveira GHott HGomide LCosta LSilva MLacerda APappa G(2024)PLUS: A Semi-automated Pipeline for Fraud Detection in Public BidsDigital Government: Research and Practice10.1145/36163965:1(1-16)Online publication date: 12-Mar-2024
https://dl.acm.org/doi/10.1145/3616396
Sultanum NBromley DCorrell M(2024)Data Guards: Challenges and Solutions for Fostering Trust in Data2024 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS55277.2024.00019(56-60)Online publication date: 13-Oct-2024
https://doi.org/10.1109/VIS55277.2024.00019
Cherry BBernard JKintziger TNagy CCleve ALanza M(2024)A Multivocal Mapping Study of MongoDB Smells2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00086(792-803)Online publication date: 12-Mar-2024
https://doi.org/10.1109/SANER60148.2024.00086
Malhotra SDhabalia DGhobash APathak VSabah HAlwan AAl-Fatlawy RAbdtawfeeq T(2024)An Identification and Analysis of Emotion Via Number of Visits and PR In Designing ML Models2024 4th International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE)10.1109/ICACITE60783.2024.10617396(1252-1256)Online publication date: 14-May-2024
https://doi.org/10.1109/ICACITE60783.2024.10617396
Recupito GPecorelli FCatolino GLenarduzzi VTaibi DDi Nucci DPalomba F(2024)Technical debt in AI-enabled systems: On the prevalence, severity, impact, and management strategies for code and architectureJournal of Systems and Software10.1016/j.jss.2024.112151216(112151)Online publication date: Oct-2024
https://doi.org/10.1016/j.jss.2024.112151
Takeuchi HHusen JTun HWashizaki HYoshioka N(2024)Enterprise architecture-based metamodel for machine learning projects and its managementFuture Generation Computer Systems10.1016/j.future.2024.06.062161(135-145)Online publication date: Dec-2024
https://doi.org/10.1016/j.future.2024.06.062
Stege NBreitner M(2024)Insights into commonalities of a sample: A visualization framework to explore unusual subset-dataset relationshipsData & Knowledge Engineering10.1016/j.datak.2024.102299151(102299)Online publication date: May-2024
https://doi.org/10.1016/j.datak.2024.102299
Yang ZWang CShi JHoang TKochhar PLu QXing ZLo D(2023)What Do Users Ask in Open-Source AI Repositories? An Empirical Study of GitHub Issues2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00024(79-91)Online publication date: May-2023
https://doi.org/10.1109/MSR59073.2023.00024
Simon EVidoni MFard F(2023)Algorithm Debt: Challenges and Future Paths2023 IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN)10.1109/CAIN58948.2023.00020(90-91)Online publication date: May-2023
https://doi.org/10.1109/CAIN58948.2023.00020
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents