[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3597926.3598080acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article

Towards More Realistic Evaluation for Neural Test Oracle Generation

Published: 13 July 2023 Publication History

Abstract

Unit testing has become an essential practice during software development and maintenance. Effective unit tests can help guard and improve software quality but require a substantial amount of time and effort to write and maintain. A unit test consists of a test prefix and a test oracle. Synthesizing test oracles, especially functional oracles, is a well-known challenging problem. Recent studies proposed to leverage neural models to generate test oracles, i.e., neural test oracle generation (NTOG), and obtained promising results. However, after a systematic inspection, we find there are some inappropriate settings in existing evaluation methods for NTOG. These settings could mislead the understanding of existing NTOG approaches’ performance. We summarize them as 1) generating test prefixes from bug-fixed program versions, 2) evaluating with an unrealistic metric, and 3) lacking a straightforward baseline. In this paper, we first investigate the impacts of these settings on evaluating and understanding the performance of NTOG approaches. We find that 1) unrealistically generating test prefixes from bug-fixed program versions inflates the number of bugs found by the state-of-the-art NTOG approach TOGA by 61.8%, 2) FPR (False Positive Rate) is not a realistic evaluation metric and the Precision of TOGA is only 0.38%, and 3) a straightforward baseline NoException, which simply expects no exception should be raised, can find 61% of the bugs found by TOGA with twice the Precision. Furthermore, we introduce an additional ranking step to existing evaluation methods and propose an evaluation metric named Found@K to better measure the cost-effectiveness of NTOG approaches in terms of bug-finding. We propose a novel unsupervised ranking method to instantiate this ranking step, significantly improving the cost-effectiveness of TOGA. Eventually, based on our experimental results and observations, we propose a more realistic evaluation method TEval+ for NTOG and summarize seven rules of thumb to boost NTOG approaches into their practical usages.

References

[1]
2022. TOGA Artifact. https://github.com/microsoft/toga
[2]
2023. Our Replication Package. https://github.com/Tbabm/TEval-plus
[3]
Andrea Arcuri and Lionel Briand. 2011. A Practical Guide for Using Statistical Tests to Assess Randomized Algorithms in Software Engineering. In Proceedings of the 33rd International Conference on Software Engineering. ACM, 1–10. isbn:978-1-4503-0445-0 https://doi.org/10.1145/1985793.1985795
[4]
Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2014. The Oracle Problem in Software Testing: A Survey. IEEE transactions on software engineering, 41, 5 (2014), 507–525. https://doi.org/10.1109/TSE.2014.2372785
[5]
Al Bessey, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson Engler. 2010. A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World. Commun. ACM, 53, 2 (2010), 66–75.
[6]
Arianna Blasi, Alberto Goffi, Konstantin Kuznetsov, Alessandra Gorla, Michael D. Ernst, Mauro Pezzè, and Sergio Delgado Castellanos. 2018. Translating Code Comments to Procedure Specifications. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. 242–253. https://doi.org/10.1145/1646353.1646374
[7]
Arianna Blasi, Alessandra Gorla, Michael D. Ernst, Mauro Pezzè, and Antonio Carzaniga. 2021. MeMo: Automatically Identifying Metamorphic Relations in Javadoc Comments for Test Automation. Journal of Systems and Software, 181 (2021), 111041. https://doi.org/10.1016/j.jss.2021.111041
[8]
Lilian Burdy, Yoonsik Cheon, David R. Cok, Michael D. Ernst, Joseph R. Kiniry, Gary T. Leavens, K. Rustan M. Leino, and Erik Poll. 2005. An Overview of JML Tools and Applications. International journal on software tools for technology transfer, 7, 3 (2005), 212–232. https://doi.org/10.1007/s10009-004-0167-4
[9]
Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2021. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair. IEEE Transactions on Software Engineering, 47, 9 (2021), 1943–1959. https://doi.org/10.1109/TSE.2019.2940179
[10]
Norman Cliff. 2014. Ordinal Methods for Behavioral Data Analysis. Psychology Press.
[11]
Ermira Daka and Gordon Fraser. 2014. A Survey on Unit Testing Practices and Problems. In Proceedings of the 25th IEEE International Symposium on Software Reliability Engineering. IEEE Computer Society, 201–211. https://doi.org/10.1109/ISSRE.2014.11
[12]
Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, and Shuvendu K. Lahiri. 2022. TOGA: A Neural Method for Test Oracle Generation. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering. ACM, 2130–2141. https://doi.org/10.1145/3510003.3510141
[13]
Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. ACM, 416–419. https://doi.org/10.1145/2025113.2025179
[14]
Gordon Fraser and Andrea Arcuri. 2012. Whole Test Suite Generation. IEEE Transactions on Software Engineering, 39, 2 (2012), 276–291. https://doi.org/10.1109/TSE.2012.14
[15]
Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: Directed Automated Random Testing. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 213–223. https://doi.org/10.1145/1065010.1065036
[16]
Alberto Goffi, Alessandra Gorla, Michael D. Ernst, and Mauro Pezzè. 2016. Automatic Generation of Oracles for Exceptional Behaviors. In Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, 213–224. https://doi.org/10.1145/2931037.2931061
[17]
Alan Hartman. 2002. Is ISSTA Research Relevant to Industry? ACM SIGSOFT Software Engineering Notes, 27, 4 (2002), 205–206. https://doi.org/10.1145/566172.566207
[18]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep Code Comment Generation. In Proceedings of the 26th Conference on Program Comprehension. ACM, 200–210. https://doi.org/10.1145/3196321.3196334
[19]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2019. Deep Code Comment Generation with Hybrid Lexical and Syntactical Information. Empirical Software Engineering, 1–39. https://doi.org/10.1007/s10664-019-09730-9
[20]
Kobi Inkumsah and Tao Xie. 2008. Improving Structural Testing of Object-Oriented Programs via Integrating Evolutionary Testing and Symbolic Execution. In Proceedings of the 23rd IEEE/ACM International Conference on Automated Software Engineering. IEEE, 297–306. https://doi.org/10.1109/ASE.2008.40
[21]
Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge. 2013. Why Don’t Software Developers Use Static Analysis Tools to Find Bugs? In Proceedings of the 35th International Conference on Software Engineering. IEEE, 672–681. https://doi.org/10.1109/ICSE.2013.6606613
[22]
René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis. ACM, 437–440. https://doi.org/10.1145/2610384.2628055
[23]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
[24]
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation Forest. In Proceedings of the 8th IEEE International Conference on Data Mining. IEEE, 413–422. https://doi.org/10.1109/ICDM.2008.17
[25]
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-Based Anomaly Detection. ACM Transactions on Knowledge Discovery from Data, 6, 1 (2012), 1–39. https://doi.org/10.1145/2133360.2133363
[26]
Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated Unit Test Generation for Python. 168–172. https://doi.org/10.1145/3510454.3516829
[27]
Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2020. Automated Unit Test Generation for Python. In Proceedings of the 12th International Symposium on Search Based Software Engineering. Springer, 9–24. https://doi.org/10.1007/978-3-030-59762-7_2
[28]
Jan Malburg and Gordon Fraser. 2011. Combining Search-Based and Constraint-Based Testing. In Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering. IEEE, 436–439. https://doi.org/10.1109/ASE.2011.6100092
[29]
Antonio Mastropaolo, Nathan Cooper, David Nader-Palacio, Simone Scalabrino, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2023. Using Transfer Learning for Code-Related Tasks. IEEE Transactions on Software Engineering, 49, 4 (2023), 1580–1598. https://doi.org/10.1109/TSE.2022.3183297
[30]
Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks. In Proceedings of the 43rd IEEE/ACM International Conference on Software Engineering. 336–347. issn:1558-1225 https://doi.org/10.1109/ICSE43902.2021.00041
[31]
Barton P. Miller, Lars Fredriksen, and Bryan So. 1990. An Empirical Study of the Reliability of UNIX Utilities. Commun. ACM, 33, 12 (1990), 32–44. https://doi.org/10.1145/96267.96279
[32]
Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedback-Directed Random Testing for Java. In Companion to the 22nd ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications Companion. ACM, 815–816. https://doi.org/10.1145/1297846.1297902
[33]
Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. 2007. Feedback-Directed Random Test Generation. In Proceedings of the 29th International Conference on Software Engineering. IEEE Computer Society, 75–84. issn:1558-1225 https://doi.org/10.1109/ICSE.2007.37
[34]
Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, and Amit Paradkar. 2012. Inferring Method Specifications from Natural Language API Descriptions. In Proceedings of the 34th International Conference on Software Engineering. IEEE, 815–825. https://doi.org/10.1109/ICSE.2012.6227137
[35]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, and Vincent Dubourg. 2011. Scikit-Learn: Machine Learning in Python. the Journal of machine Learning research, 12 (2011), 2825–2830. https://doi.org/10.5555/1953048.2078195
[36]
Strategic Planning. 2002. The Economic Impacts of Inadequate Infrastructure for Software Testing. National Institute of Standards and Technology, 1.
[37]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21, 1 (2020), 5485–5551.
[38]
Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: a concolic unit testing engine for C. In Proceedings of the 10th European Software Engineering Conference held jointly with the 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 263–272. https://doi.org/10.1145/1081706.1081750
[39]
Shin Hwei Tan, Darko Marinov, Lin Tan, and Gary T. Leavens. 2012. @ Tcomment: Testing Javadoc Comments to Detect Comment-Code Inconsistencies. In Proceedings of the 5th International Conference on Software Testing, Verification and Validation. 260–269. https://doi.org/10.1109/ICST.2012.106
[40]
Nikolai Tillmann and Jonathan de Halleux. 2008. Pex-White Box Test Generation for .NET. In Proceedings of the 2nd International Conference on Tests and Proofs. Springer, 134–153. https://doi.org/10.1007/978-3-540-79124-9_10
[41]
Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel Sundaresan. 2020. Unit test case generation with transformers and focal context. arXiv preprint arXiv:2009.05617, https://doi.org/10.48550/arXiv.2009.05617
[42]
Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, and Neel Sundaresan. 2022. Generating Accurate Assert Statements for Unit Test Cases Using Pretrained Transformers. In Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test. Association for Computing Machinery, 54–64. isbn:978-1-4503-9286-0 https://doi.org/10.1145/3524481.3527220
[43]
Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On Learning Meaningful Code Changes via Neural Machine Translation. In Proceedings of the 41st International Conference on Software Engineering. IEEE/ACM, 25–36. https://doi.org/10.1109/ICSE.2019.00021
[44]
Cody Watson, Michele Tufano, Kevin Moran, Gabriele Bavota, and Denys Poshyvanyk. 2020. On Learning Meaningful Assert Statements for Unit Test Cases. In Proceedings of the 42nd International Conference on Software Engineering. ACM, 1398–1409. https://doi.org/10.1145/3377811.3380429
[45]
Frank Wilcoxon. 1992. Individual Comparisons by Ranking Methods. In Breakthroughs in Statistics. Springer, 196–202.
[46]
Tao Xie. 2006. Augmenting Automatically Generated Unit-Test Suites with Regression Oracle Checking. In Proceedings of the 20th European Conference on Object-Oriented Programming. Springer, 380–403. https://doi.org/10.1007/11785477_23
[47]
Tao Xie, Darko Marinov, Wolfram Schulte, and David Notkin. 2005. Symstra: A Framework for Generating Object-Oriented Unit Tests Using Symbolic Execution. In Proceedings of the 11th International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 365–381. https://doi.org/10.1007/978-3-540-31980-1_24
[48]
Juan Zhai, Yu Shi, Minxue Pan, Guian Zhou, Yongxiang Liu, Chunrong Fang, Shiqing Ma, Lin Tan, and Xiangyu Zhang. 2020. C2S: Translating Natural Language Comments to Formal Program Specifications. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 25–37. https://doi.org/10.1145/3368089.3409716
[49]
Sai Zhang, David Saff, Yingyi Bu, and Michael D. Ernst. 2011. Combined Static and Dynamic Automated Test Generation. In Proceedings of the 20th International Symposium on Software Testing and Analysis. ACM, 353–363. https://doi.org/10.1145/2001420.2001463
[50]
Xiaogang Zhu, Sheng Wen, Seyit Camtepe, and Yang Xiang. 2022. Fuzzing: A Survey for Roadmap. Comput. Surveys, 54, 11s (2022), 1–36. https://doi.org/10.1145/3512345

Cited By

View all
  • (2024)Automatically Recommend Code Updates: Are We There Yet?ACM Transactions on Software Engineering and Methodology10.1145/367816733:8(1-27)Online publication date: 16-Jul-2024
  • (2024)An Empirical Study on Focal Methods in Deep-Learning-Based Approaches for Assertion GenerationProceedings of the ACM on Software Engineering10.1145/36607851:FSE(1750-1771)Online publication date: 12-Jul-2024
  • (2024)Practitioners’ Expectations on Automated Test GenerationProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680386(1618-1630)Online publication date: 11-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis
July 2023
1554 pages
ISBN:9798400702211
DOI:10.1145/3597926
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 July 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Neural Network
  2. Realistic Evaluation
  3. Test Oracle Generation

Qualifiers

  • Research-article

Funding Sources

Conference

ISSTA '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)122
  • Downloads (Last 6 weeks)13
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Automatically Recommend Code Updates: Are We There Yet?ACM Transactions on Software Engineering and Methodology10.1145/367816733:8(1-27)Online publication date: 16-Jul-2024
  • (2024)An Empirical Study on Focal Methods in Deep-Learning-Based Approaches for Assertion GenerationProceedings of the ACM on Software Engineering10.1145/36607851:FSE(1750-1771)Online publication date: 12-Jul-2024
  • (2024)Practitioners’ Expectations on Automated Test GenerationProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680386(1618-1630)Online publication date: 11-Sep-2024
  • (2024)Method-Level Test-to-Code Traceability Link Construction by Semantic Correlation LearningIEEE Transactions on Software Engineering10.1109/TSE.2024.344991750:10(2656-2676)Online publication date: Oct-2024
  • (2024)Assessing Evaluation Metrics for Neural Test Oracle GenerationIEEE Transactions on Software Engineering10.1109/TSE.2024.343346350:9(2337-2349)Online publication date: 25-Jul-2024
  • (2024)Deep learning-based software engineering: progress, challenges, and opportunitiesScience China Information Sciences10.1007/s11432-023-4127-568:1Online publication date: 24-Dec-2024
  • (2023)An Empirical Evaluation of Using Large Language Models for Automated Unit Test GenerationIEEE Transactions on Software Engineering10.1109/TSE.2023.333495550:1(85-105)Online publication date: 28-Nov-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media