[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3522664.3528596acmconferencesArticle/Chapter ViewAbstractPublication PagescainConference Proceedingsconference-collections
research-article
Open access

Exploring ML testing in practice: lessons learned from an interactive rapid review with axis communications

Published: 17 October 2022 Publication History

Abstract

There is a growing interest in industry and academia in machine learning (ML) testing. We believe that industry and academia need to learn together to produce rigorous and relevant knowledge. In this study, we initiate a collaboration between stakeholders from one case company, one research institute, and one university. To establish a common view of the problem domain, we applied an interactive rapid review of the state of the art. Four researchers from Lund University and RISE Research Institutes and four practitioners from Axis Communications reviewed a set of 180 primary studies on ML testing. We developed a taxonomy for the communication around ML testing challenges and results and identified a list of 12 review questions relevant for Axis Communications. The three most important questions (data testing, metrics for assessment, and test generation) were mapped to the literature, and an in-depth analysis of the 35 primary studies matching the most important question (data testing) was made. A final set of the five best matches were analysed and we reflect on the criteria for applicability and relevance for the industry. The taxonomies are helpful for communication but not final. Furthermore, there was no perfect match to the case company's investigated review question (data testing). However, we extracted relevant approaches from the five studies on a conceptual level to support later context-specific improvements. We found the interactive rapid review approach useful for triggering and aligning communication between the different stakeholders.

References

[1]
Nauman Bin Ali, Emelie Engström, Masoumeh Taromirad, Mohammad Mousavi, Nasir Mehmood Minhas, Daniel Helgesson, Sebastian Kunze, and Mahsa Varshoaz. 2019. On the search for industry-relevant regression testing research. Empirical Software Engineering 24, 4 (2019), 2020--2055.
[2]
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software Engineering for Machine Learning: A Case Study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 291--300.
[3]
Elizabeth Bjarnason, Per Runeson, Markus Borg, Michael Unterkalmsteiner, Emelie Engstrom, Bjorn Regnell, Giedre Sabaliauskaite, Annabella Loconsole, Tony Gorschek, and Robert Feldt. 2014. Challenges and practices in aligning requirements with verification and validation: a case study of six companies. Empirical software engineering 19, 6 (2014), 1809--1855.
[4]
Jan-Aike Bolte, Andreas Bar, Daniel Lipinski, and Tim Fingscheidt. 2019. Towards corner case detection for autonomous driving. In 2019 IEEE Intelligent vehicles symposium (IV). IEEE, 438--445.
[5]
Markus Borg. 2021. The AIQ Meta-Testbed: Pragmatically Bridging Academic AI Testing and Industrial Q Needs. In Software Quality: Future Perspectives on Software Engineering Quality, Dietmar Winkler, Stefan Biffl, Daniel Mendez, Manuel Wimmer, and Johannes Bergsmann (Eds.). Springer International Publishing, Cham, 66--77.
[6]
Markus Borg. 2022. Agility in Software 2.0 2 Notebook Interfaces and MLOps with Buttresses and Rebars. In Proc. of the International Conference on Lean and Agile Software Development. Springer, 3--16.
[7]
Markus Borg, Joshua Bronson, Linus Christensson, Fredrik Olsson, Olof Lennartsson, Elias Sonnsjö, Hamid Ebadi, and Martin Karsberg. 2021. Exploring the Assessment List for Trustworthy AI in the Context of Advanced Driver-Assistance Systems. In 2021 IEEE/ACM 2nd International Workshop on Ethics in Software Engineering Research and Practice (SEthics). IEEE, 5--12.
[8]
Markus Borg, Cristofer Englund, Krzysztof Wnuk, Boris Duran, Christoffer Lewandowski, Shenjian Gao, Yanwen Tan, Henrik Kaijser, Henrik Lönn, and Jonnas Törnqvist. 2019. Safely Entering the Deep: A Review of Verification and Validation for Machine Learning and a Challenge Elicitation in the Automotive Industry. Journal of Automotive Software Engineering 1, 1 (2019), 1--19.
[9]
Markus Borg, Ronald Jabangwe, Simon Aberg, Arvid Ekblom, Ludwig Hedlund, and August Lidfeldt. 2021. Test automation with grad-CAM Heatmaps-A future pipe segment in MLOps for Vision AI?. In 2021 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 175--181.
[10]
Jan Bosch, Helena Holmström Olsson, and Ivica Crnkovic. 2021. Engineering ai systems: A research agenda. In Artificial Intelligence Paradigms for Smart Cyber-Physical Systems. IGI Global, 1--19.
[11]
Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. In MLSys.
[12]
Taejoon Byun, Vaibhav Sharma, Abhishek Vijayakumar, Sanjai Rayadurgam, and Darren Cofer. 2019. Input prioritization for testing neural networks. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 63--70.
[13]
Anita D Carleton, Erin Harper, Michael R Lyu, Sigrid Eldh, Tao Xie, and Tim Menzies. 2020. Expert Perspectives on AI. IEEE Software 37, 4 (2020), 87--94.
[14]
Bruno Cartaxo, Gustavo Pinto, and Sergio Soares. 2020. Rapid Reviews in Software Engineering. Springer International Publishing, Cham, 357--384.
[15]
Chih-Hong Cheng, Chung-Hao Huang, Harald Ruess, Hirotoshi Yasuoka, et al. 2018. Towards dependability metrics for neural networks. In 2018 16th ACM/IEEE International Conference on Formal Methods and Models for System Design (MEMOCODE). IEEE, 1--4.
[16]
Hamid Ebadi, Mahshid Helali Moghadam, Markus Borg, Gregory Gay, Afonso Fontes, and Kasper Socha. 2021. Efficient and Effective Generation of Test Cases for Pedestrian Detection-Search-based Software Testing of Baidu Apollo in SVL. In 2021 IEEE International Conference on Artificial Intelligence Testing (AITest). IEEE, 103--110.
[17]
Emelie Engstrom, Kai Petersen, Nauman Bin Ali, and Elizabeth Bjarnason. 2017. SERP-Test: A Taxonomy for Supporting Industry---Academia Communication. Software Quality Journal 25, 4 (dec 2017), 1269--1305.
[18]
Michael Felderer, Barbara Russo, and Florian Auer. 2019. On testing data-intensive software systems. In Security and Quality in Cyber-Physical Systems Engineering. Springer, 129--148.
[19]
Jens Henriksson, Christian Berger, Markus Borg, Lars Tornberg, Cristofer Englund, Sankar Raman Sathyamoorthy, and Stig Ursing. 2019. Towards structured evaluation of deep neural network supervisors. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 27--34.
[20]
Nick Hynes, D Sculley, and Michael Terry. 2017. The data linter: Lightweight, automated sanity checking for ml data sets. In NIPS MLSys Workshop.
[21]
ISO/IEC. 2008. ISO 25012 Systems and software engineering - Systems and software quality requirements and evaluation (SQuaRE) - Data quality model.
[22]
ISO/IEC. 2011. ISO 25010 Systems and software engineering - Systems and software quality requirements and evaluation (SQuaRE) - System and software quality models.
[23]
Fei Jiang, Yong Jiang, Hui Zhi, Yi Dong, Hao Li, Sufeng Ma, Yilong Wang, Qiang Dong, Haipeng Shen, and Yongjun Wang. 2017. Artificial intelligence in healthcare: past, present and future. Stroke and vascular neurology 2, 4 (2017).
[24]
Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1039--1049.
[25]
Jinhan Kim, Jeongil Ju, Robert Feldt, and Shin Yoo. 2020. Reducing dnn labelling cost using surprise adequacy: An industrial case study for autonomous driving. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1466--1476.
[26]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: Interactive Data Cleaning for Statistical Modeling. Proc. VLDB Endow. 9, 12 (aug 2016), 948--959.
[27]
Lucy Ellen Lwakatare, Aiswarya Raj, Ivica Crnkovic, Jan Bosch, and Helena Holmström Olsson. 2020. Large-scale machine learning systems in real-world industrial settings: A review of challenges and solutions. Information and Software Technology 127 (2020), 106368.
[28]
Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, et al. 2018. Deepmutation: Mutation testing of deep learning systems. In 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 100--111.
[29]
Dusica Marijan, Arnaud Gotlieb, and Mohit Kumar Ahuja. 2019. Challenges of Testing Machine Learning Based Systems. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). 101--102.
[30]
Mahshid Helali Moghadam, Markus Borg, and Seyed Jalaleddin Mousavirad. 2021. Deeper at the sbst 2021 tool competition: ADAS testing using multi-objective search. In 2021 IEEE/ACM 14th International Workshop on Search-Based Software Testing (SBST). IEEE, 40--41.
[31]
Kai Petersen and Emelie Engström. 2014. Finding Relevant Research Solutions for Practical Problems: The Serp Taxonomy Architecture. In Proceedings of the 2014 International Workshop on Long-Term Industrial Collaboration on Software Engineering (WISE '14). Association for Computing Machinery, New York, NY, USA, 13--20.
[32]
Geoff Pleiss, Tianyi Zhang, Ethan Elenberg, and Kilian Q Weinberger. 2020. Identifying Mislabeled Data using the Area Under the Margin Ranking. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. 17044--17056.
[33]
Vincenzo Riccio, Gunel Jahangirova, Andrea Stocco, Nargiz Humbatova, Michael Weiss, and Paolo Tonella. 2020. Testing machine learning based systems: a systematic mapping. Empirical Software Engineering 25, 6 (2020), 5193--5254.
[34]
Sergio Rico, N. Ali, Emelie Engström, and Martin Höst. 2020. Guidelines for conducting interactive rapid reviews in software engineering - from a focus on technology transfer to knowledge exchange.
[35]
Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proceedings of the VLDB Endowment 11, 12 (2018), 1781--1794.
[36]
D. Sculley et al. 2015. Hidden Technical Debt in Machine Learning Systems. In Proc. of the 28th Int'l Conf. on Neural Information Proc. Systems. 2503--2511.
[37]
Salman Sherin, Muhammad Uzair khan, and Muhammad Zohaib Iqbal. 2019. A Systematic Mapping Study on Testing of Machine Learning Programs. arXiv:cs.LG/1907.09427
[38]
Qunying Song, Markus Borg, Emelie Engström, Håkan Ardö, and Sergio Rico. 2022. Primary Studies.
[39]
Margaret-Anne Storey, Emelie Engstrom, Martin Höst, Per Runeson, and Elizabeth Bjarnason. 2017. Using a Visual Abstract as a Lens for Communicating and Promoting Design Science Research in Software Engineering. In 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 181--186.
[40]
Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th international conference on software engineering. 303--314.
[41]
Andrea C Tricco, Chantelle M Garritty, Leah Boulos, Craig Lockwood, Michael Wilson, Jessie McGowan, Michael McCaul, Brian Hutton, Fiona Clement, Nicole Mittmann, et al. 2020. Rapid review methods more challenging during COVED-19: commentary with a focus on 8 knowledge synthesis steps. Journal of clinical epidemiology 126 (2020), 177--183.
[42]
Jonathan Uesato, Ananya Kumar, Csaba Szepesvari, Tom Erez, Avraham Ruderman, Keith Anderson, Nicolas Heess, Pushmeet Kohli, et al. 2018. Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures. arXiv preprint arXiv: 1812.01647 (2018).
[43]
Guillaume Vidot, Christophe Gabreau, Ileana Ober, and Iulian Ober. 2021. Certification of embedded systems based on Machine Learning: A survey. arXiv preprint arXiv.2106.07221 (2021).
[44]
Andreas Vogelsang and Markus Borg. 2019. Requirements Engineering for Machine Learning: Perspectives from Data Scientists. In Proc. of the 27th Int'l Requirements Engineering Conf Workshops. 245--251.
[45]
Jie Zhang, Earl Barr, Benjamin Guedj, Mark Harman, and John Shawe-Taylor. 2019. Perturbed model validation: A new framework to validate model relevance. (2019).
[46]
Jie M Zhang, Mark Harman, Lei Ma, and Yang Liu. 2020. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering (2020).
[47]
Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. 2018. DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 132--142.

Cited By

View all
  • (2024)Quality assurance strategies for machine learning applications in big data analytics: an overviewJournal of Big Data10.1186/s40537-024-01028-y11:1Online publication date: 30-Oct-2024
  • (2024)Experiences from conducting rapid reviews in collaboration with practitioners — Two industrial casesInformation and Software Technology10.1016/j.infsof.2023.107364167:COnline publication date: 12-Apr-2024
  • (2024)Using rapid reviews to support software engineering practice: a systematic review and a replication studyEmpirical Software Engineering10.1007/s10664-024-10545-630:1Online publication date: 25-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CAIN '22: Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI
May 2022
254 pages
ISBN:9781450392754
DOI:10.1145/3522664
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE TCSC: IEEE Technical Committee on Scalable Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. AI engineering
  2. interactive rapid review
  3. machine learning
  4. taxonomy
  5. testing

Qualifiers

  • Research-article

Funding Sources

  • Kompetensfonden
  • Wallenberg AI, Autonomous Systems and Software Program (WASP)

Conference

CAIN '22
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)197
  • Downloads (Last 6 weeks)12
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Quality assurance strategies for machine learning applications in big data analytics: an overviewJournal of Big Data10.1186/s40537-024-01028-y11:1Online publication date: 30-Oct-2024
  • (2024)Experiences from conducting rapid reviews in collaboration with practitioners — Two industrial casesInformation and Software Technology10.1016/j.infsof.2023.107364167:COnline publication date: 12-Apr-2024
  • (2024)Using rapid reviews to support software engineering practice: a systematic review and a replication studyEmpirical Software Engineering10.1007/s10664-024-10545-630:1Online publication date: 25-Oct-2024
  • (2023)What Do Users Ask in Open-Source AI Repositories? An Empirical Study of GitHub Issues2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00024(79-91)Online publication date: May-2023
  • (2023)Ergo, SMIRK is safe: a safety case for a machine learning component in a pedestrian automatic emergency brake systemSoftware Quality Journal10.1007/s11219-022-09613-131:2(335-403)Online publication date: 1-Mar-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media