[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content

Advertisement

Log in

A comparative study of software defect binomial classification prediction models based on machine learning

  • Research
  • Published:
Software Quality Journal Aims and scope Submit manuscript

Abstract

As information technology continues to advance, software applications are becoming increasingly critical. However, the growing size and complexity of software development can lead to serious flaws resulting in significant financial losses. To address this issue, Software Defect Prediction (SDP) technology is being developed to detect and resolve defects early in the software development process, ensuring high software quality. As a result, SDP research has become a major focus for academics worldwide. This study aims to compare various machine learning-based SDP algorithm models and determine if traditional machine learning algorithms affect SDP outcomes. Unlike previous studies that aimed to identify the best prediction model for all datasets, this paper constructs SDP superiority models separately for different datasets. Using the publicly available ESEM2016 dataset, 13 machine learning classification algorithms are employed to predict software defects. Evaluation indicators such as Accuracy, AUC(Area Under the Curve), F-measure, and Running Time(RT) are utilized to assess the performance of the classification algorithms. Due to the serious class imbalance problem in this dataset, 10 sampling methods are combined with the 13 machine learning algorithms to explore the effect of sampling techniques on the performance of traditional machine learning classification models. Finally, a comprehensive evaluation is conducted to identify the best combination of sampling techniques and classification models to construct the final dominant model for SDP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The dataset used in this research is openly accessible via: https://github.com/tjshippey/ESEM2016.

References

  • Ali, A., Khan, N., Abu-Tair, M., Noppen, J., McClean, S., & McChesney, I. (2021). Discriminating features-based cost-sensitive approach for software defect prediction. Automated Software Engineering, 28, 11. https://doi.org/10.1007/s10515-021-00289-8

    Article  Google Scholar 

  • Andersson, C., & Runeson, P. (2007). A Replicated Quantitative Analysis of Fault Distributions in Complex Software Systems. IEEE Transactions on Software Engineering, 33(5), 273–286. https://doi.org/10.1109/TSE.2007.1005

    Article  Google Scholar 

  • Bakir, B., Batmaz, I., Gunturkun, F., Ipekci, I. A., Koksal, G., & Ozdemirel, N. E. (2008). Defect cause modeling with decision tree and regression analysis. International Journal of Industrial and Manufacturing Engineering, 2(12), 1334–1337.

    Google Scholar 

  • Batool, I., & Khan, T. A. (2023). Software fault prediction using deep learning techniques. Software Quality Journal. https://doi.org/10.1007/s11219-023-09642-4

    Article  Google Scholar 

  • Bennin, K. E., Keung, J. W., & Monden, A. (2019). On the relative value of data resampling approaches for software defect prediction. Empirical Software Engineering, 24, 602–636. https://doi.org/10.1007/s10664-018-9633-6

    Article  Google Scholar 

  • Bhargava, N., Sharma, G., Bhargava, R., & Mathuria, M. (2013). Decision tree analysis on j48 algorithm for data mining. Proceedings of International Journal of Advanced Research in Computer Science and Software Engineering, 3(6).

  • Bhat, N. A., & Farooq, S. U. (2023). An empirical evaluation of defect prediction approaches in within-project and cross-project context. Software Quality Journal, 31, 917–946. https://doi.org/10.1007/s11219-023-09615-7

    Article  Google Scholar 

  • Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

    Article  Google Scholar 

  • Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery And Data Mining(pp. 785-794). ACM, San Francisco California USA. https://doi.org/10.1145/2939672.2939785

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018

    Article  Google Scholar 

  • Cox, D. R. (1958). The Regression Analysis of Binary Sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215–232. https://doi.org/10.1111/j.2517-6161.1958.tb00292.x

    Article  MathSciNet  Google Scholar 

  • Dhall, S., & Chug, A. (2013). Software Defect Prediction Using Supervised Learning Algorithm and Unsupervised Learning Algorithm. In Confluence 2013: The Next Generation Information Technology Summit (4th International Conference) (pp. 5-5). Institution of Engineering and Technology, Noida, India. https://doi.org/10.1049/cp.2013.2313

  • Fawagreh, K., Gaber, M. M., & Elyan, E. (2014). Random forests: from early developments to recent advancements. Systems Science & Control Engineering, 2(1), 602–609. https://doi.org/10.1080/21642583.2014.956265

    Article  Google Scholar 

  • Felix, E. A., & Lee, S. P. (2019). Systematic literature review of preprocessing techniques for imbalanced data. IET Software, 13(6), 479–496. https://doi.org/10.1049/iet-sen.2018.5193

    Article  Google Scholar 

  • Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139. https://doi.org/10.1006/jcss.1997.1504

    Article  MathSciNet  Google Scholar 

  • Ganguly, K. K., & Mainul Hossain, B. M. (2018). Evaluating the Effectiveness of Conventional Machine Learning Techniques for Defect Prediction: A Comparative Study. In 2018 Joint 7th International Conference on Informatics, Electronics & Vision (ICIEV) and 2018 2nd International Conference on Imaging, Vision & Pattern Recognition (icIVPR) (pp. 481-485). IEEE, Kitakyushu, Japan. https://doi.org/10.1109/ICIEV.2018.8641006

  • Ge, J., Liu, J., & Liu, W. (2018). Comparative Study on Defect Prediction Algorithms of Supervised Learning Software Based on Imbalanced Classification Data Sets. In 2018 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD) (pp. 399-406). IEEE, Busan. https://doi.org/10.1109/SNPD.2018.8441143

  • Giray, G., Bennin, K. E., Köksal, Ö., Babur, Ö., & Tekinerdogan, B. (2023). On the use of deep learning in software defect prediction. Journal of Systems and Software, 195, 111537. https://doi.org/10.1016/j.jss.2022.111537

    Article  Google Scholar 

  • Gong, L., Jiang, S., & Jiang, L. (2019). Research progress of software defect prediction technology. Journal of Software, 30, 3090–3114. https://doi.org/10.13328/j.cnki.jos.005790

    Article  Google Scholar 

  • Hart, P. (1968). The condensed nearest neighbor rule (corresp.). IEEE Transactions on Information Theory, 14(3), 515–516.

    Article  Google Scholar 

  • He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (pp. 1322-1328). IEEE, Hong Kong, China. https://doi.org/10.1109/IJCNN.2008.4633969

  • Iqbal, A., Aftab, S., Ali, U., Nawaz, Z., Sana, L., Ahmad, M., & Husen, A. (2019). Performance analysis of machine learning techniques on software defect prediction using nasa datasets. International Journal of Advanced Computer Science and Applications, 10(5), 300–308. https://doi.org/10.14569/IJACSA.2019.0100538

    Article  Google Scholar 

  • Jiang, Y., Cukic, B., & Ma, Y. (2008). Techniques for evaluating fault prediction models. Empirical Software Engineering, 13, 561–595. https://doi.org/10.1007/s10664-008-9079-3

    Article  Google Scholar 

  • Kakkar, M., & Jain, S. (2016). Feature selection in software defect prediction: A comparative study. In 2016 6th International Conference - Cloud System and Big Data Engineering (Confluence) (pp. 658-663). IEEE, Noida, India. https://doi.org/10.1109/CONFLUENCE.2016.7508200

  • Kamei, Y., Shihab, E., Adams, B., Hassan, A. E., Mockus, A., Sinha, A., & Ubayashi, N. (2013). A large-scale empirical study of just-in-time quality assurance. IEEE Transactions on Software Engineering, 39(6), 757–773. https://doi.org/10.1109/TSE.2012.70

    Article  Google Scholar 

  • Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA.

  • Khan, B., Naseem, R., Shah, M. A., Wakil, K., Khan, A., Uddin, M. I., & Mahmoud, M. (2021). Software Defect Prediction for Healthcare Big Data: An Empirical Evaluation of Machine Learning Techniques. Journal of Healthcare Engineering, 2021, 1–16. https://doi.org/10.1155/2021/8899263

    Article  Google Scholar 

  • Kondo, M., German, D. M., Mizuno, O., & Choi, E.-H. (2020). The impact of context metrics on just-in-time defect prediction. Empirical Software Engineering, 25(1), 890–939. https://doi.org/10.1007/s10664-019-09736-3

    Article  Google Scholar 

  • Liu, Y., Cheah, W. P., Kim, B.-K., & Park, H. (2008). Predict software failure prone by learning bayesian network. International Journal of Advanced Science and Technology, 1(1), 35–42.

    Google Scholar 

  • Liu, Y., Zhang, W., Qin, G., & Zhao, J. (2022). A comparative study on the effect of data imbalance on software defect prediction. Procedia Computer Science, 214, 1603–1616. https://doi.org/10.1016/j.procs.2022.11.349

    Article  Google Scholar 

  • Li, Y., & Wu, H. (2012). A clustering method based on k-means algorithm. Physics Procedia, 25, 1104–1109. https://doi.org/10.1016/j.phpro.2012.03.206

    Article  Google Scholar 

  • Li, Z., Wu, Y., H., W., Chen, X., & Liu, Y. (2022). A survey of software multiple defect localization methods. Journal of Computer Science, 45(2), 256–288.

  • Malhotra, R. (2015). A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing, 27, 504–518. https://doi.org/10.1016/j.asoc.2014.11.023

    Article  Google Scholar 

  • Mizuno, O., & Hata, H. (2010). An Integrated Approach to Detect Fault-Prone Modules Using Complexity and Text Feature Metrics. In Advances in Computer Science and Information Technology, Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13577-4_41

  • Morasca, S., & Lavazza, L. (2020). On the assessment of software defect prediction models via roc curves. Empirical Software Engineering, 25, 3977–4019. https://doi.org/10.1007/s10664-020-09861-4

    Article  Google Scholar 

  • Mori, T., & Uchihira, N. (2019). Balancing the trade-off between accuracy and interpretability in software defect prediction. Empirical Software Engineering, 24(2), 779–825. https://doi.org/10.1007/s10664-018-9638-1

    Article  Google Scholar 

  • Nagappan, N., Ball, T., & Zeller, A. (2006). Mining metrics to predict component failures. In Proceedings of the 28th International Conference on Software Engineering (pp. 452-461). ACM, Shanghai China. https://doi.org/10.1145/1134285.1134349

  • Okutan, A., & Yildiz, O. T. (2014). Software defect prediction using Bayesian networks. Empirical Software Engineering, 19(1), 154–181. https://doi.org/10.1007/s10664-012-9218-8

    Article  Google Scholar 

  • Ozakinci, R., & Kolukisa Tarhan, A. (2023). A decision analysis approach for selecting software defect prediction method in the early phases. Software Quality Journal, 31, 121–177. https://doi.org/10.1007/s11219-022-09595-0

    Article  Google Scholar 

  • Parashar, A., Kumar Goyal, R., Kaushal, S., & Kumar Sahana, S. (2022). Machine learning approach for software defect prediction using multi-core parallel computing. Automated Software Engineering, 29, 44. https://doi.org/10.1007/s10515-022-00340-2

    Article  Google Scholar 

  • Pelayo, L., & Dick, S. (2007). Applying Novel Resampling Strategies To Software Defect Prediction. In NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society (pp. 69-72). IEEE, SanDiego, CA, USA. https://doi.org/10.1109/NAFIPS.2007.383813

  • Prati, R.C., Batista, G. E. A. P. A., & Monard, M. C. (2004). Learning with Class Skews and Small Disjuncts. In Advances in Artificial Intelligence - SBIA 2004 (pp. 296-306). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-28645-5_30

  • Rath, S. K., Sahu, M., Das, S. P., Bisoy, S. K., & Sain, M. (2022). A Comparative Analysis of SVM and ELM Classification on Software Reliability Prediction Model. Electronics, 11(17), 2707. https://doi.org/10.3390/electronics11172707

    Article  Google Scholar 

  • Ridgeway, G., Madigan, D., Richardson, T., & O’Kane, J. (1998). Interpretable Boosted Naive Bayes Classification. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (pp. 101-104). AAAI Press.

  • Shippey, T., Hall, T., Counsell, S., & Bowes, D. (2016). So you need more method level datasets for your software defect prediction? voila! Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, (pp. 1–6). https://doi.org/10.1145/2961111.2962620

  • Singh, H., & Kaur, K. (2013). New Method for Finding Initial Cluster Centroids in K-means Algorithm. International Journal of Computer Applications, 74(6), 27–30. https://doi.org/10.5120/12890-9837

    Article  Google Scholar 

  • Song, Q., Guo, Y., & Shepperd, M. (2019). A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction. IEEE Transactions on Software Engineering, 45(12), 1253–1269. https://doi.org/10.1109/TSE.2018.2836442

    Article  Google Scholar 

  • Stradowski, S., & Madeyski, L. (2023). Bridging the Gap Between Academia and Industry in Machine Learning Software Defect Prediction: Thirteen Considerations. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) (pp. 1098-1110). IEEE, Luxembourg, Luxembourg. https://doi.org/10.1109/ASE56229.2023.00026

  • Suma, V., Pushphavathi, T. P., & Ramaswamy, V. (2014). An Approach to Predict Software Project Success Based on Random Forest Classifier. Advances in Intelligent Systems and Computing, 249, 329–336. https://doi.org/10.1007/978-3-319-03095-1_36

    Article  Google Scholar 

  • Tomek, I. (1976). Two Modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics,6, 769–772. https://doi.org/10.1109/TSMC.1976.4309452

  • Wang, S., & Yao, X. (2013). Using Class Imbalance Learning for Software Defect Prediction. IEEE Transactions on Reliability, 62(2), 434–443. https://doi.org/10.1109/TR.2013.2259203

    Article  Google Scholar 

  • Wilson, D. L. (1972). Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetics, 2, 408–421. https://doi.org/10.1109/TSMC.1972.4309137

    Article  MathSciNet  Google Scholar 

  • Xia, X., Shihab, E., Kamei, Y., Lo, D., & Wang, X. (2016). Predicting Crashing Releases of Mobile Applications. In Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering And Measurement (pp. 1-10). ACM, Ciudad Real Spain. https://doi.org/10.1145/2961111.2962606

  • Yan, Z., Chen, X., & Guo, P. (2010). Software Defect Prediction Using Fuzzy Support Vector Regression. In Advances in Neural Networks (pp. 17-24). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13318-3_3

  • Zhang, W., Yan, S., Li, J., Tian, X., & Yoshida, T. (2022). Credit risk prediction of SMEs in supply chain finance by fusing demographic and behavioral data. Transportation Research Part E: Logistics and Transportation Review, 158, 102611. https://doi.org/10.1016/j.tre.2022.102611

    Article  Google Scholar 

Download references

Acknowledgements

The authors express great thanks to the financial support from the National Natural Science Foundation of China, the Department of Science and Technology of Henan Province, and the Zhengzhou University of Light Industry.

Funding

This work was financially supported by the National Natural Science Foundation of China (61906175), the Doctoral Research Fund of Zhengzhou University of Light Industry (2020BSJJ067), the Science and Technology Project of Henan Province (222102210096, 232102210014, 242102210033, 242102211050), and the Henan Province Higher Education Teaching Reform Research and Practice Project (2021SJGLX292).

Author information

Authors and Affiliations

Authors

Contributions

Hongwei Tao: conceptualization, methodology, writing original draft, validation, supervision. Xiaoxu Niu: data curation, methodology, writing original draft, software, investigation, validation. Lang Xu: data curation, investigation. Lianyou Fu: writing–review and editing. Qiaoling Cao: writing–review and editing. Haoran Chen: writing–review and editing. Songtao Shang: writing–review and editing. Yang Xian: writing–review and editing.

Corresponding author

Correspondence to Hongwei Tao.

Ethics declarations

Competing interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tao, H., Niu, X., Xu, L. et al. A comparative study of software defect binomial classification prediction models based on machine learning. Software Qual J 32, 1203–1237 (2024). https://doi.org/10.1007/s11219-024-09683-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11219-024-09683-3

Keywords

Navigation