Abstract
Over the years, there has been a considerable discussion regarding machine learning (ML) techniques to forecast software faults. It can be challenging to choose a suitable machine learning technique for fault prediction modelling because of variations in the prediction performance of machine learning techniques for software systems. The evaluation of previously presented software fault prediction (SFP) approaches revealed that single machine learning-based models that did not deliver the best accuracy and F1-Score in any context, emphasizing the need to use multiple techniques, such as sampling and selection, in addition to the application of machine learning models. In order to address this issue, we present and discuss a method for predicting software faults that rely on choosing the most suitable machine learning and deep learning techniques from a pool of accurate and competitive learning techniques in order to construct a fault prediction model. The presented approach chooses the best features using Mutual Information feature selection technique. Using a hybrid sampling (SMOTE-Tomek) techniques, the issue of class imbalance (CI) is addressed. Then finally a Genetic Algorithm based machine learning (GA-DT) and a deep learning model (ANN-DT) are developed for the purpose for predicting faults in a software. For empirical evaluation, Eclipse version dataset (2.0, 2.1, and 3.0) is considered. Precision, recall, accuracy, and f1-Score are the performance metrics we used to evaluate the effectiveness of the proposed approach. The results demonstrated that the proposed approach (GA-DT and ANN-DT) effectively predicted the software's faults with ANN-DT providing best accuracies for all the three versions of Eclipse dataset.
Similar content being viewed by others
References
K. E. Bennin, J. W. Keung, and A. Monden, “On the relative value of data resampling approaches for software defect prediction,” Empirical Software Eng, vol. 24, no. 2, pp. 602–636, Jun. 2018.
Pelayo L and Dick S (2007) “Applying Novel Resampling Strategies To Software Defect Prediction,” NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society
Mahmood Y, Kama N, Azmi A, Khan AS, Ali M (2021) Software effort estimation accuracy prediction of machine learning techniques: A systematic performance evaluation. Softw: Pract Exp 52(1):39–65
A. Abid, M. T. Khan, and J. Iqbal, “A review on fault detection and diagnosis techniques: basics and beyond,” Artificial Intel Rev, vol. 54, no. 5, pp. 3639–3664, Nov. 2020.
Gupta M, Rajnish K, Bhattarcharjee V (2020) “Predicting Software Cost Through Entity–Relationship Diagrams: An Empirical View,” Lecture Notes in Electrical Engineering, pp. 561–567
Le TMH, Nguyen TB, Khuat TT (2015) Survey on Mutation-based Test Data Generation. Int J Electric Comput Eng (IJECE) 5(5):1164–1173
Tomar D, Agarwal S (2016) Prediction of Defective Software Modules Using Class Imbalance Learning. Appl Comput Intell Soft Comput 2016:1–12
Kaur H, Pannu HS, Malhi AK (2020) A Systematic Review on Imbalanced Data Challenges in Machine Learning. ACM Comput Surv 52(4):1–36
Khuat TT, Le MH (2019) Ensemble learning for software fault prediction problem with imbalanced data. Int J Electric Comput Eng (IJECE) 9(4):3241
Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inform Software Technol 58:388–402
A. O. Balogun, S. Basri, S. Mahamad, S. J. Abdulkadir, L. F. Capretz, A. A. Imam, M. A. Almomani, V. E. Adeyemo, and G. Kumar, “Empirical Analysis of Rank Aggregation-Based Multi-Filter Feature Selection Methods in Software Defect Prediction,” Electronics, vol. 10, no. 2, p. 179, Jan. 2021.
M. Gupta, K. Rajnish, and V. Bhattacharjee, “Impact of Parameter Tuning for Optimizing Deep Neural Network Models for Predicting Software Faults,” Scientific Programm, vol. 2021, pp. 1–17, Jun. 2021.
Alsghaier H, Akour M (2020) Software fault prediction using Whale algorithm with genetics algorithm. Software: Pract Exp 51(5):1121–1146
Hamdia KM, Zhuang X, Rabczuk T (2020) An efficient optimization approach for designing machine learning models based on genetic algorithm. Neural Comput Appl 33(6):1923–1933
Sohail A (2021) Genetic Algorithms in the Fields of Artificial Intelligence and Data Sciences. Ann Data Sci
Bal PR, Kumar S (2018) Cross project software defect prediction using extreme learning machine: an ensemble based study. In: ICSOFT, pp 354–361
Sohan MF, Kabir MA, Jabiullah MI, Rahman SSMM (2019) Revisiting the Class Imbalance Issue in Software Defect Prediction, 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE)
R. Malhotra and S. Kamal, “An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data,” Neurocomputing, vol. 343, pp. 120–140, May 2019.
Khuat TT, Le MH (2020) “Evaluation of Sampling-Based Ensembles of Classifiers on Imbalanced Data for Software Defect Prediction Problems”. SN Comput Sci 1(2)
Zheng J, Wang X, Wei D, Chen B, Shao Y (2021) A Novel Imbalanced Ensemble Learning in Software Defect Predication. IEEE Access 9:86855–86868. https://doi.org/10.1109/ACCESS.2021.3072682
Balogun AO, Lafenwa-Balogun FB, Mojeed HA, Adeyemo VE, Akande ON, Akintola AG, Bajeh AO, Usman-Hamza FE (2020) SMOTE-Based Homogeneous Ensemble Methods for Software Defect Prediction. Lecture Notes Comput Sci:615–631
Elahi E, Ayub A, Hussain I (2021) Two staged data preprocessing ensemble model for software fault prediction," 2021 International Bhurban Conference on Applied Sciences and Technologies (IBCAST)
Goel L, Sharma M, Khatri SK, Damodaran D (2019) Cross-project defect prediction using data sampling for class imbalance learning: an empirical study. Int J Parallel, Emergent Distrib Syst 36(2):130–143
Guoqiang X, Shiyi X, Xiaohong P, Zhao L (2021) Prediction of Number of Software Defects based on SMOTE. Int J Performabil Eng 17(1):123
Pandey SK, Tripathi AK (2021) Class Imbalance Issue in Software Defect Prediction Models by various Machine Learning Techniques: An Empirical Study, 2021 8th International Conference on Smart Computing and Communications (ICSCC)
S. K. Pandey and A. K. Tripathi, “An empirical study toward dealing with noise and class imbalance issues in software defect prediction,” Soft Comput, vol. 25, no. 21, pp. 13465–13492, Aug. 2021.
M. Rostami, K. Berahmand, E. Nasiri, and S. Forouzandeh, “Review of swarm intelligence-based feature selection methods,” Eng Appl Artificial Intell, vol. 100, p. 104210, Apr. 2021, doi: https://doi.org/10.1016/j.engappai.2021.104210.
M. Rostami, S. Forouzandeh, K. Berahmand, M. Soltani, M. Shahsavari, and M. Oussalah, “Gene selection for microarray data classification via multi-objective graph theoretic-based method,” Artificial Intell Med, vol. 123, p. 102228, Jan. 2022, doi: https://doi.org/10.1016/j.artmed.2021.102228.
Arora R, Kaur A (2022) Heterogeneous Fault Prediction Using Feature Selection and Supervised Learning Algorithms. Vietnam J Comput Sci:1–24
Kumar R, Chaturvedi A, Kailasam L (2022) An Unsupervised Software Fault Prediction Approach Using Threshold Derivation. IEEE Trans Reliabil:1–22
Stańczyk U (2014) “Feature Evaluation by Filter, Wrapper, and Embedded Approaches.” Feature Selection for Data and Pattern Recognition, Part of the Studies in Computational Intelligence book series (SCI,volume 584)
Nagpal A, Gaur D, Gaur S (2014) Feature selection using mutual information for high- dimensional data sets. IEEE Int Adv Comput Conf (IACC) 2014:45–49. https://doi.org/10.1109/IAdCC.2014.6779292
Jović A, Brkić K, Bogunović N (2015) “A review of feature selection methods with applications”, 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)
Rathore SS, Chouhan SS, Jain DK, Vachhani AG (2022) “Generative Oversampling Methods for Handling Imbalanced Data in Software Fault Prediction,” IEEE Transactions on Reliability, pp. 1–16
F. Charte, A. J. Rivera, M. J. del Jesus, and F. Herrera, “Addressing imbalance in multilabel classification: Measures and random resampling algorithms,” Neurocomputing, vol. 163, pp. 3–16, Sep. 2015.
K. E. Bennin, J. W. Keung, and A. Monden, “On the relative value of data resampling approaches for software defect prediction,” Empirical Software Eng, vol. 24, no. 2, pp. 602–636, Jun. 2018.
E. F. Swana, W. Doorsamy, and P. Bokoro, “Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset,” Sensors, vol. 22, no. 9, p. 3246, Apr. 2022.
Jonathan B, Putra PH, Ruldeviyani Y (2020) “Observation Imbalanced Data Text to Predict Users Selling Products on Female Daily with SMOTE, Tomek, and SMOTE-Tomek,” 2020 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT)
Huang Y and Li L (2011) “Naive Bayes classification algorithm based on small sample set,” 2011 IEEE International Conference on Cloud Computing and Intelligence Systems
M. Khanna, A. Toofani, S. Bansal, and M. Asif, “Performance Comparison of Various Algorithms During Software Fault Prediction,” Int J Grid and High Perform Comput, vol. 13, no. 2, pp. 70–94, Apr. 2021.
S. Goyal, “Handling Class-Imbalance with KNN (Neighbourhood) Under-Sampling for Software Defect Prediction,” Artificial Intell Rev, vol. 55, no. 3, pp. 2023–2064, Aug. 2021.
Palak and Gulia P (2022) “Decision tree–based improved software fault prediction: a computational intelligence approach,” Computational Intelligence in Software Modeling, pp. 163–176
Kramer O (2017) “Genetic Algorithms”, In: Genetic Algorithm Essentials, Part of the Studies in Computational Intelligence book series (SCI, volume 679)
Tabassum M, Mathew K (2014) A genetic algorithm analysis towards optimization solutions. Int J Digital Inform Wireless Commun (IJDIWC) 4(1):124–142
K. M. Hamdia, X. Zhuang, and T. Rabczuk, “An efficient optimization approach for designing machine learning models based on genetic algorithm,” Neural Comput Appl, vol. 33, no. 6, pp. 1923–1933, Jun. 2020.
I. D. Raji, H. Bello-Salau, I. J. Umoh, A. J. Onumanyi, M. A. Adegboye, and A. T. Salawudeen, “Simple Deterministic Selection-Based Genetic Algorithm for Hyperparameter Tuning of Machine Learning Models,” Appl Sci, vol. 12, no. 3, p. 1186, Jan. 2022.
Available at https://sklearn-genetic-opt.readthedocs.io/
Mangla M, Sharma N, Mohanty SN (2021) “A sequential ensemble model for software fault prediction,” Innov Syst Software Eng
S. S. Rathore and S. Kumar, “Software fault prediction based on the dynamic selection of learning technique: findings from the eclipse project study,” Appl Intell, vol. 51, no. 12, pp. 8945–8960, Apr. 2021.
Data availability statement
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study. We have use publicly available Eclipse version dataset (Eclipse 2.0, Eclipse 2.1, and Eclipse 3.0).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gupta, M., Rajnish, K. & Bhattacharjee, V. Software fault prediction with imbalanced datasets using SMOTE-Tomek sampling technique and Genetic Algorithm models. Multimed Tools Appl 83, 47627–47648 (2024). https://doi.org/10.1007/s11042-023-16788-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16788-7