[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content

Advertisement

Log in

Software fault prediction with imbalanced datasets using SMOTE-Tomek sampling technique and Genetic Algorithm models

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Over the years, there has been a considerable discussion regarding machine learning (ML) techniques to forecast software faults. It can be challenging to choose a suitable machine learning technique for fault prediction modelling because of variations in the prediction performance of machine learning techniques for software systems. The evaluation of previously presented software fault prediction (SFP) approaches revealed that single machine learning-based models that did not deliver the best accuracy and F1-Score in any context, emphasizing the need to use multiple techniques, such as sampling and selection, in addition to the application of machine learning models. In order to address this issue, we present and discuss a method for predicting software faults that rely on choosing the most suitable machine learning and deep learning techniques from a pool of accurate and competitive learning techniques in order to construct a fault prediction model. The presented approach chooses the best features using Mutual Information feature selection technique. Using a hybrid sampling (SMOTE-Tomek) techniques, the issue of class imbalance (CI) is addressed. Then finally a Genetic Algorithm based machine learning (GA-DT) and a deep learning model (ANN-DT) are developed for the purpose for predicting faults in a software. For empirical evaluation, Eclipse version dataset (2.0, 2.1, and 3.0) is considered. Precision, recall, accuracy, and f1-Score are the performance metrics we used to evaluate the effectiveness of the proposed approach. The results demonstrated that the proposed approach (GA-DT and ANN-DT) effectively predicted the software's faults with ANN-DT providing best accuracies for all the three versions of Eclipse dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. K. E. Bennin, J. W. Keung, and A. Monden, “On the relative value of data resampling approaches for software defect prediction,” Empirical Software Eng, vol. 24, no. 2, pp. 602–636, Jun. 2018.

  2. Pelayo L and Dick S (2007) “Applying Novel Resampling Strategies To Software Defect Prediction,” NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society

  3. Mahmood Y, Kama N, Azmi A, Khan AS, Ali M (2021) Software effort estimation accuracy prediction of machine learning techniques: A systematic performance evaluation. Softw: Pract Exp 52(1):39–65

    Google Scholar 

  4. A. Abid, M. T. Khan, and J. Iqbal, “A review on fault detection and diagnosis techniques: basics and beyond,” Artificial Intel Rev, vol. 54, no. 5, pp. 3639–3664, Nov. 2020.

  5. Gupta M, Rajnish K, Bhattarcharjee V (2020) “Predicting Software Cost Through Entity–Relationship Diagrams: An Empirical View,” Lecture Notes in Electrical Engineering, pp. 561–567

  6. Le TMH, Nguyen TB, Khuat TT (2015) Survey on Mutation-based Test Data Generation. Int J Electric Comput Eng (IJECE) 5(5):1164–1173

    Article  Google Scholar 

  7. Tomar D, Agarwal S (2016) Prediction of Defective Software Modules Using Class Imbalance Learning. Appl Comput Intell Soft Comput 2016:1–12

    Google Scholar 

  8. Kaur H, Pannu HS, Malhi AK (2020) A Systematic Review on Imbalanced Data Challenges in Machine Learning. ACM Comput Surv 52(4):1–36

    Article  Google Scholar 

  9. Khuat TT, Le MH (2019) Ensemble learning for software fault prediction problem with imbalanced data. Int J Electric Comput Eng (IJECE) 9(4):3241

    Article  Google Scholar 

  10. Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inform Software Technol 58:388–402

    Article  Google Scholar 

  11. A. O. Balogun, S. Basri, S. Mahamad, S. J. Abdulkadir, L. F. Capretz, A. A. Imam, M. A. Almomani, V. E. Adeyemo, and G. Kumar, “Empirical Analysis of Rank Aggregation-Based Multi-Filter Feature Selection Methods in Software Defect Prediction,” Electronics, vol. 10, no. 2, p. 179, Jan. 2021.

  12. M. Gupta, K. Rajnish, and V. Bhattacharjee, “Impact of Parameter Tuning for Optimizing Deep Neural Network Models for Predicting Software Faults,” Scientific Programm, vol. 2021, pp. 1–17, Jun. 2021.

  13. Alsghaier H, Akour M (2020) Software fault prediction using Whale algorithm with genetics algorithm. Software: Pract Exp 51(5):1121–1146

    Google Scholar 

  14. Hamdia KM, Zhuang X, Rabczuk T (2020) An efficient optimization approach for designing machine learning models based on genetic algorithm. Neural Comput Appl 33(6):1923–1933

    Article  Google Scholar 

  15. Sohail A (2021) Genetic Algorithms in the Fields of Artificial Intelligence and Data Sciences. Ann Data Sci

  16. Bal PR, Kumar S (2018) Cross project software defect prediction using extreme learning machine: an ensemble based study. In: ICSOFT, pp 354–361

  17. Sohan MF, Kabir MA, Jabiullah MI, Rahman SSMM (2019) Revisiting the Class Imbalance Issue in Software Defect Prediction, 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE)

  18. R. Malhotra and S. Kamal, “An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data,” Neurocomputing, vol. 343, pp. 120–140, May 2019.

  19. Khuat TT, Le MH (2020) “Evaluation of Sampling-Based Ensembles of Classifiers on Imbalanced Data for Software Defect Prediction Problems”. SN Comput Sci 1(2)

  20. Zheng J, Wang X, Wei D, Chen B, Shao Y (2021) A Novel Imbalanced Ensemble Learning in Software Defect Predication. IEEE Access 9:86855–86868. https://doi.org/10.1109/ACCESS.2021.3072682

    Article  Google Scholar 

  21. Balogun AO, Lafenwa-Balogun FB, Mojeed HA, Adeyemo VE, Akande ON, Akintola AG, Bajeh AO, Usman-Hamza FE (2020) SMOTE-Based Homogeneous Ensemble Methods for Software Defect Prediction. Lecture Notes Comput Sci:615–631

  22. Elahi E, Ayub A, Hussain I (2021) Two staged data preprocessing ensemble model for software fault prediction," 2021 International Bhurban Conference on Applied Sciences and Technologies (IBCAST)

  23. Goel L, Sharma M, Khatri SK, Damodaran D (2019) Cross-project defect prediction using data sampling for class imbalance learning: an empirical study. Int J Parallel, Emergent Distrib Syst 36(2):130–143

    Article  Google Scholar 

  24. Guoqiang X, Shiyi X, Xiaohong P, Zhao L (2021) Prediction of Number of Software Defects based on SMOTE. Int J Performabil Eng 17(1):123

    Article  Google Scholar 

  25. Pandey SK, Tripathi AK (2021) Class Imbalance Issue in Software Defect Prediction Models by various Machine Learning Techniques: An Empirical Study, 2021 8th International Conference on Smart Computing and Communications (ICSCC)

  26. S. K. Pandey and A. K. Tripathi, “An empirical study toward dealing with noise and class imbalance issues in software defect prediction,” Soft Comput, vol. 25, no. 21, pp. 13465–13492, Aug. 2021.

  27. M. Rostami, K. Berahmand, E. Nasiri, and S. Forouzandeh, “Review of swarm intelligence-based feature selection methods,” Eng Appl Artificial Intell, vol. 100, p. 104210, Apr. 2021, doi: https://doi.org/10.1016/j.engappai.2021.104210.

  28. M. Rostami, S. Forouzandeh, K. Berahmand, M. Soltani, M. Shahsavari, and M. Oussalah, “Gene selection for microarray data classification via multi-objective graph theoretic-based method,” Artificial Intell Med, vol. 123, p. 102228, Jan. 2022, doi: https://doi.org/10.1016/j.artmed.2021.102228.

  29. Arora R, Kaur A (2022) Heterogeneous Fault Prediction Using Feature Selection and Supervised Learning Algorithms. Vietnam J Comput Sci:1–24

  30. Kumar R, Chaturvedi A, Kailasam L (2022) An Unsupervised Software Fault Prediction Approach Using Threshold Derivation. IEEE Trans Reliabil:1–22

  31. Stańczyk U (2014) “Feature Evaluation by Filter, Wrapper, and Embedded Approaches.” Feature Selection for Data and Pattern Recognition, Part of the Studies in Computational Intelligence book series (SCI,volume 584)

  32. Nagpal A, Gaur D, Gaur S (2014) Feature selection using mutual information for high- dimensional data sets. IEEE Int Adv Comput Conf (IACC) 2014:45–49. https://doi.org/10.1109/IAdCC.2014.6779292

    Article  Google Scholar 

  33. Jović A, Brkić K, Bogunović N (2015) “A review of feature selection methods with applications”, 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)

  34. Rathore SS, Chouhan SS, Jain DK, Vachhani AG (2022) “Generative Oversampling Methods for Handling Imbalanced Data in Software Fault Prediction,” IEEE Transactions on Reliability, pp. 1–16

  35. F. Charte, A. J. Rivera, M. J. del Jesus, and F. Herrera, “Addressing imbalance in multilabel classification: Measures and random resampling algorithms,” Neurocomputing, vol. 163, pp. 3–16, Sep. 2015.

  36. K. E. Bennin, J. W. Keung, and A. Monden, “On the relative value of data resampling approaches for software defect prediction,” Empirical Software Eng, vol. 24, no. 2, pp. 602–636, Jun. 2018.

  37. E. F. Swana, W. Doorsamy, and P. Bokoro, “Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset,” Sensors, vol. 22, no. 9, p. 3246, Apr. 2022.

  38. Jonathan B, Putra PH, Ruldeviyani Y (2020) “Observation Imbalanced Data Text to Predict Users Selling Products on Female Daily with SMOTE, Tomek, and SMOTE-Tomek,” 2020 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT)

  39. Huang Y and Li L (2011) “Naive Bayes classification algorithm based on small sample set,” 2011 IEEE International Conference on Cloud Computing and Intelligence Systems

  40. M. Khanna, A. Toofani, S. Bansal, and M. Asif, “Performance Comparison of Various Algorithms During Software Fault Prediction,” Int J Grid and High Perform Comput, vol. 13, no. 2, pp. 70–94, Apr. 2021.

  41. S. Goyal, “Handling Class-Imbalance with KNN (Neighbourhood) Under-Sampling for Software Defect Prediction,” Artificial Intell Rev, vol. 55, no. 3, pp. 2023–2064, Aug. 2021.

  42. Palak and Gulia P (2022) “Decision tree–based improved software fault prediction: a computational intelligence approach,” Computational Intelligence in Software Modeling, pp. 163–176

  43. Kramer O (2017) “Genetic Algorithms”, In: Genetic Algorithm Essentials, Part of the Studies in Computational Intelligence book series (SCI, volume 679)

  44. Tabassum M, Mathew K (2014) A genetic algorithm analysis towards optimization solutions. Int J Digital Inform Wireless Commun (IJDIWC) 4(1):124–142

    Google Scholar 

  45. K. M. Hamdia, X. Zhuang, and T. Rabczuk, “An efficient optimization approach for designing machine learning models based on genetic algorithm,” Neural Comput Appl, vol. 33, no. 6, pp. 1923–1933, Jun. 2020.

  46. I. D. Raji, H. Bello-Salau, I. J. Umoh, A. J. Onumanyi, M. A. Adegboye, and A. T. Salawudeen, “Simple Deterministic Selection-Based Genetic Algorithm for Hyperparameter Tuning of Machine Learning Models,” Appl Sci, vol. 12, no. 3, p. 1186, Jan. 2022.

  47. Available at https://sklearn-genetic-opt.readthedocs.io/

  48. Mangla M, Sharma N, Mohanty SN (2021) “A sequential ensemble model for software fault prediction,” Innov Syst Software Eng

  49. S. S. Rathore and S. Kumar, “Software fault prediction based on the dynamic selection of learning technique: findings from the eclipse project study,” Appl Intell, vol. 51, no. 12, pp. 8945–8960, Apr. 2021.

Download references

Data availability statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study. We have use publicly available Eclipse version dataset (Eclipse 2.0, Eclipse 2.1, and Eclipse 3.0).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mansi Gupta.

Ethics declarations

Conflicts of interest

The authors declare no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, M., Rajnish, K. & Bhattacharjee, V. Software fault prediction with imbalanced datasets using SMOTE-Tomek sampling technique and Genetic Algorithm models. Multimed Tools Appl 83, 47627–47648 (2024). https://doi.org/10.1007/s11042-023-16788-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16788-7

Keywords

Navigation