[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3691620.3695017acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article
Open access

AI-driven Java Performance Testing: Balancing Result Quality with Testing Time

Published: 27 October 2024 Publication History

Abstract

Performance testing aims at uncovering efficiency issues of software systems. In order to be both effective and practical, the design of a performance test must achieve a reasonable trade-off between result quality and testing time. This becomes particularly challenging in Java context, where the software undergoes a warm-up phase of execution, due to just-in-time compilation. During this phase, performance measurements are subject to severe fluctuations, which may adversely affect quality of performance test results. Both practitioners and researchers have proposed approaches to mitigate this issue. Practitioners typically rely on a fixed number of iterated executions that are used to warm-up the software before starting to collect performance measurements (state-of-practice). Researchers have developed techniques that can dynamically stop warm-up iterations at runtime (state-of-the-art). However, these approaches often provide suboptimal estimates of the warm-up phase, resulting in either insufficient or excessive warm-up iterations, which may degrade result quality or increase testing time. There is still a lack of consensus on how to properly address this problem. Here, we propose and study an AI-based framework to dynamically halt warm-up iterations at runtime. Specifically, our framework leverages recent advances in AI for Time Series Classification (TSC) to predict the end of the warm-up phase during test execution. We conduct experiments by training three different TSC models on half a million of measurement segments obtained from JMH microbenchmark executions. We find that our framework significantly improves the accuracy of the warm-up estimates provided by state-of-practice and state-of-the-art methods. This higher estimation accuracy results in a net improvement in either result quality or testing time for up to +35.3% of the microbenchmarks. Our study highlights that integrating AI to dynamically estimate the end of the warm-up phase can enhance the cost-effectiveness of Java performance testing.

References

[1]
Milad Abdullah, Lubomír Bulej, Tomáš Bureš, Vojtěch Horký, and Petr Tůma. 2023. Early Stopping of Non-productive Performance Testing Experiments Using Measurement Mutations. In 2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). 86--93.
[2]
Hammam M. AlGhamdi, Cor-Paul Bezemer, Weiyi Shang, Ahmed E. Hassan, and Parminder Flora. 2023. Towards reducing the time needed for load testing. Journal of Software: Evolution and Process 35, 3 (2023), e2276. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/smr.2276e2276_smr.2276.
[3]
Hammam M. Alghmadi, Mark D. Syer, Weiyi Shang, and Ahmed E. Hassan. 2016. An Automated Approach for Recommending When to Stop Performance Tests. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). 279--289.
[4]
Monica Arul and Ahsan Kareem. 2021. Applications of shapelet transform to time series classification of earthquake, wind and wave data. Engineering Structures 228 (2021), 111564.
[5]
Anthony Bagnall, Jason Lines, Aaron Bostrom, James Large, and Eamonn Keogh. 2017. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery 31, 3 (2017), 606--660.
[6]
Edd Barrett, Carl Friedrich Bolz-Tereick, Rebecca Killick, Sarah Mount, and Laurence Tratt. 2017. Virtual Machine Warmup Blows Hot and Cold. Proc. ACM Program. Lang. 1, OOPSLA, Article 52 (oct 2017), 27 pages.
[7]
Jinfu Chen, Weiyi Shang, and Emad Shihab. 2022. PerfJIT: Test-Level Just-in-Time Prediction for Performance Regression Introducing Commits. IEEE Transactions on Software Engineering 48, 5 (2022), 1529--1544.
[8]
Kaixuan Chen, Dalin Zhang, Lina Yao, Bin Guo, Zhiwen Yu, and Yunhao Liu. 2021. Deep Learning for Sensor-based Human Activity Recognition: Overview, Challenges, and Opportunities. ACM Comput. Surv. 54, 4, Article 77 (may 2021), 40 pages.
[9]
Tao Chen and Miqing Li. 2021. Multi-objectivizing software configuration tuning. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 453--465.
[10]
Tse-Hsun Chen, Mark D. Syer, Weiyi Shang, Zhen Ming Jiang, Ahmed E. Hassan, Mohamed Nasser, and Parminder Flora. 2017. Analytics-Driven Load Testing: An Industrial Experience Report on Load Testing of Large-Scale Systems. In 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). 243--252.
[11]
Zhicheng Cui, Wenlin Chen, and Yixin Chen. 2016. Multi-Scale Convolutional Neural Networks for Time Series Classification. CoRR abs/1603.06995 (2016). arXiv:1603.06995 http://arxiv.org/abs/1603.06995
[12]
Charlie Curtsinger and Emery D. Berger. 2013. STABILIZER: statistically sound performance evaluation. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (Houston, Texas, USA) (ASPLOS '13). Association for Computing Machinery, New York, NY, USA, 219--228.
[13]
Hoang Anh Dau, Anthony J. Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn J. Keogh. 2018. The UCR Time Series Archive. CoRR abs/1810.07758 (2018). arXiv:1810.07758 http://arxiv.org/abs/1810.07758
[14]
Anthony Christopher Davison and David Victor Hinkley. 1997. Bootstrap methods and their application. Cambridge University Press.
[15]
Angus Dempster, François Petitjean, and Geoffrey I. Webb. 2020. ROCKET: exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery 34, 5 (2020), 1454--1495.
[16]
Zishuo Ding, Jinfu Chen, and Weiyi Shang. 2020. Towards the use of the readily available tests from the release pipeline as performance tests: are we there yet?. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE '20). Association for Computing Machinery, New York, NY, USA, 1435--1446.
[17]
B.S. Everitt and A. Skrondal. 2010. The Cambridge Dictionary of Statistics. Cambridge University Press.
[18]
Mikael Fagerström, Emre Emir Ismail, Grischa Liebel, Rohit Guliani, Fredrik Larsson, Karin Nordling, Eric Knauss, and Patrizio Pelliccione. 2016. Verdict Machinery: On the Need to Automatically Make Sense of Test Results. In Proceedings of the 25th International Symposium on Software Testing and Analysis. Association for Computing Machinery, New York, NY, USA, 225--234.
[19]
Navid Mohammadi Foumani, Lynn Miller, Chang Wei Tan, Geoffrey I. Webb, Germain Forestier, and Mahsa Salehi. 2024. Deep Learning for Time Series Classification and Extrinsic Regression: A Current Survey. ACM Comput. Surv. (feb 2024). Just Accepted.
[20]
Kunihiko Fukushima. 1969. Visual Feature Extraction by a Multilayered Network of Analog Threshold Elements. IEEE Transactions on Systems Science and Cybernetics 5, 4 (1969), 322--333.
[21]
Kunihiko Fukushima. 1980. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics 36, 4 (1980), 193--202.
[22]
Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically Rigorous Java Performance Evaluation. In Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (Montreal, Quebec, Canada) (OOPSLA '07). Association for Computing Machinery, New York, NY, USA, 57--76.
[23]
Sen He, Glenna Manns, John Saunders, Wei Wang, Lori Pollock, and Mary Lou Soffa. 2019. A Statistics-Based Performance Testing Methodology for Cloud Applications. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Tallinn, Estonia) (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 188--199.
[24]
Tim C. Hesterberg. 2015. What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum. The American Statistician 69, 4 (2015), 371--386. arXiv:https://doi.org/10.1080/00031305.2015.1089789 27019512.
[25]
Donald E. Hilt and Donald W. Seegrist. 1977. Ridge, a computer program for calculating ridge regression estimates. Vol. no.236. Upper Darby, Pa, Dept. of Agriculture, Forest Service, Northeastern Forest Experiment Station, 1977. 10 pages. https://www.biodiversitylibrary.org/item/137258 https://www.biodiversitylibrary.org/bibliography/68934.
[26]
Muhammad Imran, Vittorio Cortellessa, Davide Di Ruscio, Riccardo Rubei, and Luca Traini. 2024. An Empirical Study on Code Coverage of Performance Testing. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (Salerno, Italy) (EASE '24). Association for Computing Machinery, New York, NY, USA, 48--57.
[27]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML'15). JMLR.org, 448--456.
[28]
Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. 2019. Deep learning for time series classification: a review. Data Mining and Knowledge Discovery 33, 4 (2019), 917--963.
[29]
Mostafa Jangali, Yiming Tang, Niclas Alexandersson, Philipp Leitner, Jinqiu Yang, and Weiyi Shang. 2023. Automated Generation and Evaluation of JMH Microbenchmark Suites From Unit Tests. IEEE Transactions on Software Engineering 49, 4 (2023), 1704--1725.
[30]
Zhen Ming Jiang and Ahmed E. Hassan. 2015. A Survey on Load Testing of Large-Scale Software Systems. IEEE Transactions on Software Engineering 41, 11 (2015), 1091--1118.
[31]
Tomas Kalibera and Richard Jones. 2012. Quantifying Performance Changes with Effect Size Confidence Intervals. Technical Report 4--12. University of Kent. 55 pages. http://www.cs.kent.ac.uk/pubs/2012/3233
[32]
Tomas Kalibera and Richard Jones. 2013. Rigorous Benchmarking in Reasonable Time. In Proceedings of the 2013 International Symposium on Memory Management (Seattle, Washington, USA) (ISMM '13). Association for Computing Machinery, New York, NY, USA, 63--74.
[33]
Dave S. Kerby. 2014. The Simple Difference Formula: An Approach to Teaching Nonparametric Correlation. Comprehensive Psychology 3 (2014), 11.IT.3.1. arXiv:https://doi.org/10.2466/11.IT.3.1
[34]
R. Killick, P. Fearnhead, and I. A. Eckley. 2012. Optimal Detection of Changepoints With a Linear Computational Cost. J. Amer. Statist. Assoc. 107, 500 (2012), 1590--1598. http://www.jstor.org/stable/23427357
[35]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980
[36]
S. Kullback and R. A. Leibler. 1951. On Information and Sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 79 -- 86.
[37]
Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner. 2020. Dynamically Reconfiguring Software Microbenchmarks: Reducing Execution Time without Sacrificing Result Quality. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 989--1001.
[38]
Christoph Laaber, Tao Yue, and Shaukat Ali. 2024. Evaluating Search-Based Software Microbenchmark Prioritization. IEEE Transactions on Software Engineering (2024), 1--16.
[39]
William B. Langdon and Mark Harman. 2015. Optimizing Existing Software With Genetic Programming. IEEE Transactions on Evolutionary Computation 19, 1 (2015), 118--135.
[40]
Philipp Leitner and Cor-Paul Bezemer. 2017. An Exploratory Study of the State of Practice of Performance Testing in Java-Based Open Source Projects. In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering (L'Aquila, Italy) (ICPE '17). Association for Computing Machinery, New York, NY, USA, 373--384.
[41]
Min Lin, Qiang Chen, and Shuicheng Yan. 2014. Network In Network. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14--16, 2014, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1312.4400
[42]
Jason Lines and Anthony Bagnall. 2015. Time series classification with ensembles of elastic distance measures. Data Mining and Knowledge Discovery 29, 3 (01 May 2015), 565--592.
[43]
Sourav Majumdar and Arnab Kumar Laha. 2020. Clustering and classification of time series using topological data analysis with applications to finance. Expert Systems with Applications 162 (2020), 113868.
[44]
Aleksander Maricq, Dmitry Duplyakin, Ivo Jimenez, Carlos Maltzahn, Ryan Stutsman, and Robert Ricci. 2018. Taming Performance Variability. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 409--425. https://www.usenix.org/conference/osdi18/presentation/maricq
[45]
Matthew Middlehurst, Patrick Schäfer, and Anthony J. Bagnall. 2023. Bake off redux: a review and experimental evaluation of recent time series classification algorithms. CoRR abs/2304.13029 (2023). arXiv:2304.13029
[46]
Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2009. Producing wrong data without doing anything obviously wrong!. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (Washington, DC, USA) (ASPLOS XIV). Association for Computing Machinery, New York, NY, USA, 265--276.
[47]
Scott Oaks. 2014. Java Performance: The Definitive Guide: Getting the Most Out of Your Code. " O'Reilly Media, Inc.".
[48]
OpenJDK. 2014. Java Microbenchmarking Harness (JMH). https://github.com/openjdk/jmh/. Accessed: 2024-09-09.
[49]
Justyna Petke, Saemundur O. Haraldsson, Mark Harman, William B. Langdon, David R. White, and John R. Woodward. 2018. Genetic Improvement of Software: A Comprehensive Survey. IEEE Transactions on Evolutionary Computation 22, 3 (2018), 415--432.
[50]
Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew Dai, Nissan Hajaj, Peter Liu, Xiaobing Liu, Mimi Sun, Patrik Sundberg, Hector Yee, Kun Zhang, Gavin Duggan, Gerardo Flores, Michaela Hardt, Jamie Irvine, Quoc Le, Kurt Litsch, Jake Marcus, Alexander Mossin, and Jeff Dean. 2018. Scalable and accurate deep learning for electronic health records. npj Digital Medicine 1 (01 2018).
[51]
Patrick Schäfer. 2015. The BOSS is concerned with time series classification in the presence of noise. Data Mining and Knowledge Discovery 29, 6 (2015), 1505--1530.
[52]
Chang Wei Tan, Angus Dempster, Christoph Bergmeir, and Geoffrey I. Webb. 2022. MultiRocket: multiple pooling operators and transformations for fast and effective time series classification. Data Min. Knowl. Discov. 36, 5 (sep 2022), 1623--1646.
[53]
Wensi Tang, Guodong Long, Lu Liu, Tianyi Zhou, Michael Blumenstein, and Jing Jiang. 2022. Omni-Scale CNNs: a simple and effective kernel size configuration for time series classification. In International Conference on Learning Representations. https://openreview.net/forum?id=PDYs7Z2XFGv
[54]
Luca Traini. 2022. Exploring Performance Assurance Practices and Challenges in Agile Software Development: An Ethnographic Study. Empirical Software Engineering 27, 3 (2022), 74.
[55]
Luca Traini, Vittorio Cortellessa, Daniele Di Pompeo, and Michele Tucci. 2022. Towards effective assessment of steady state performance in Java software: are we there yet? Empirical Software Engineering 28, 1 (2022), 13.
[56]
Luca Traini, Daniele Di Pompeo, Michele Tucci, Bin Lin, Simone Scalabrino, Gabriele Bavota, Michele Lanza, Rocco Oliveto, and Vittorio Cortellessa. 2021. How Software Refactoring Impacts Execution Time. ACM Trans. Softw. Eng. Methodol. 31, 2, Article 25 (dec 2021), 23 pages.
[57]
András Vargha and Harold D. Delaney. 2000. A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics 25, 2 (2000), 101--132.
[58]
Shu Wang, Henry Hoffmann, and Shan Lu. 2022. AgileCtrl: a self-adaptive framework for configuration tuning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 459--471.
[59]
Zhiguang Wang, Weizhong Yan, and Tim Oates. 2017. Time series classification from scratch with deep neural networks: A strong baseline. In 2017 International Joint Conference on Neural Networks (IJCNN). 1578--1585.
[60]
Frank Wilcoxon. 1945. Individual Comparisons by Ranking Methods. Biometrics Bulletin 1, 6 (1945), 80--83. http://www.jstor.org/stable/3001968
[61]
W. J. Youden. 1950. Index for rating diagnostic tests. Cancer 3, 1 (1950), 32--35. <32::AID-CNCR2820030106>3.0.CO;2-3"
[62]
Zejun Zhang, Zhenchang Xing, Xin Xia, Xiwei Xu, Liming Zhu, and Qinghua Lu. 2023. Faster or Slower? Performance Mystery of Python Idioms Unveiled with Empirical Evidence. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE '23). IEEE Press, 1495--1507.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASE '24: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering
October 2024
2587 pages
ISBN:9798400712487
DOI:10.1145/3691620
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2024

Check for updates

Badges

Author Tags

  1. microbenchmarking
  2. JMH
  3. Java
  4. time series classification

Qualifiers

  • Research-article

Funding Sources

  • Ministero dell?Università e della Ricerca
  • European Union - NextGenerationEU

Conference

ASE '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 97
    Total Downloads
  • Downloads (Last 12 months)97
  • Downloads (Last 6 weeks)97
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media