EkmEx - an extended framework for labeling an unlabeled fault dataset

Muhammad Rizwan¹,
Aamer Nadeem¹,
Sohail Sarwar ORCID: orcid.org/0000-0001-7565-439X²,
Muddesar Iqbal²,
Muhammad Safyan³ &
…
Zia Ul Qayyum⁴

200 Accesses
1 Altmetric
Explore all metrics

Abstract

Software fault prediction (SFP) is a quality assurance process that identifies if certain modules are fault-prone (FP) or not-fault-prone (NFP). Hence, it minimizes the testing efforts incurred in terms of cost and time. Supervised machine learning techniques have capacity to spot-out the FP modules. However, such techniques require fault information from previous versions of software product. Such information, accumulated over the life-cycle of software, may neither be readily available nor reliable. Currently, clustering with experts’ opinions is a prudent choice for labeling the modules without any fault information. However, the asserted technique may not fully comprehend important aspects such as selection of experts, conflict in expert opinions, catering the diverse expertise of domain experts etc. In this paper, we propose a comprehensive framework named EkmEx that extends the conventional fault prediction approaches while providing mathematical foundation through aspects not addressed so far. The EkmEx guides in selection of experts, furnishes an objective solution for resolve of verdict-conflicts and manages the problem of diversity in expertise of domain experts. We performed expert-assisted module labeling through EkmEx and conventional clustering on seven public datasets of NASA. The empirical outcomes of research exhibit significant potential of the proposed framework in identifying FP modules across all seven datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Dynamic learner selection for cross-project fault prediction

Article 18 November 2024

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks

Article 04 April 2024

Automated Tool for Extraction of Software Fault Data

References

AbuHassan A, Alshayeb M, Ghouti L (2020) Software smell detection techniques: A systematic literature review. J Softw Evol Process :e2320
Alsghaier H, Akour M (2020) Software fault prediction using particle swarm algorithm with genetic algorithm and support vector machine classifier. Softw Pract Exper 50(4):407–427. https://doi.org/10.1002/spe.2784
Article Google Scholar
Al-Shaaby A, Aljamaan H, Alshayeb M (2020) Bad smell detection using machine learning techniques: A systematic literature review. Arab J Sci Eng :1–29
Amasaki S (2020) Cross-version defect prediction: use historical data, cross-project data, or both? Empir Softw Eng :1–23
Beecham S, Hall T, Bowes D, Gray D, Counsell S, Black S (2010) A systematic review of fault prediction approaches used in software engineering. The Irish Software Engineering Research Centre, Limerick, Ireland
Beecham S, Hall T, Bowes D, Gray D, Counsell S, Black S (2010) A systematic review of fault prediction approaches used in software engineering, Technical Report Lero-TR-2010-04, Lero, Tech Rep.
Bender R (1999) Quantitative risk assessment in epidemiological studies investigating threshold effects. Biometric J 41(3):305–319
Article Google Scholar
Bird C, Bachmann A, Aune E, Duffy J, Bernstein (2009) Fair and balanced? bias in bug-fix datasets. In: Proceedings of the 7th joint meeting of the european software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ser. ESEC/FSE ’09. Association for Computing Machinery, New York, pp 121–130. https://doi.org/10.1145/1595696.1595716
Bishnu PS, Bhattacherjee V (2012) Software fault prediction using quad tree-based k-means clustering algorithm. IEEE Trans Knowl Data Eng 24 (6):1146–1150
Article Google Scholar
Boetticher G, Menzies T, Ostrand T (2007) {PROMISE} repository of empirical software engineering data, ArXiv
Briand LC, Daly J, Porter V, Wust J (1998) A comprehensive empirical validation of design measures for object-oriented systems. In: Proceedings fifth international software metrics symposium, metrics (Cat. No.98TB100262), pp 246–257
Catal C (2011) Software fault prediction: A literature review and current trends. Expert Syst Appl 38(4):4626–4636
Article Google Scholar
Catal C, Diri B (2009) A systematic review of software fault prediction studies. Expert Syst Appl 36(4):7346–7354
Article Google Scholar
Catal C, Sevim U, Diri B (2009) Software fault prediction of unlabeled program modules. In: Proceedings of the world congress on engineering, vol 1, pp 1–3
Catal C, Sevim U, Diri B (2009) Clustering and metrics thresholds based software fault prediction of unlabeled program modules. In: 2009 Sixth international conference on information technology: new generations, pp 199–204
Chappelly T, Cifuentes C, Krishnan P, Gevay S (2017) Machine learning for finding bugs: An initial report. In: Machine learning techniques for software quality evaluation (MaLTeSQuE), IEEE Workshop on. IEEE, pp 21–26
El Emam K, Benlarbi S, Goel N, Rai S (1999) A validation of object-oriented metrics. National Research Council Canada Institute for Information Technology
El-Emam K, Melo W (2001) The prediction of faulty classes using object-oriented design metrics. J Syst Softw 56:02
Article Google Scholar
Fenton N, Bieman J (2014) Software metrics: a rigorous and practical approach. CRC Press, Boca Raton
Book Google Scholar
Ghani I (2014) Handbook of research on emerging advancements and technologies in software engineering. IGI Global
Gondra I (2008) Applying machine learning to software fault-proneness prediction. J Syst Softw 81(2):186–195
Article Google Scholar
Gupta R, Singh SK (2020) Using software metrics to detect temporary field code smell. In: 2020 10th international conference on cloud computing, data science engineering (Confluence), pp 45–49
Hall T, Zhang M, Bowes D, Sun Y (2014) Some code smells have a significant but small effect on faults. ACM Trans Softw Eng Methodol 23(4). https://doi.org/10.1145/2629648
Halstead MH (1977) Elements of software science (operating and programming systems series). Elsevier Science Inc., New York
MATH Google Scholar
Herbold S (2013) Training data selection for cross-project defect prediction. In: Proceedings of the 9th international conference on predictive models in software engineering, ser. PROMISE ’13. Association for Computing Machinery, New York. https://doi.org/10.1145/2499393.2499395
Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: How misclassification impacts bug prediction. In: Proceedings of the 2013 international conference on software engineering, ser. ICSE ’13. IEEE Press, pp 392–401
I. 9000:2015(en) (2015) Quality management systems — fundamentals and vocabulary, ISO
Kotková B., Hromada M (2020) Adverse event in a medical facility-blackout. Int J Power Syst 5
Li W, Shatnawi R (2007) An empirical study of the bad smells and class error probability in the post-release object-oriented system evolution. J Syst Softw 80(7):1120–1128. https://doi.org/10.1016/j.jss.2006.10.018
Article Google Scholar
Li Z, Jing X-Y, Zhu X (2018) Progress on approaches to software defect prediction. Inst Eng Technol Softw 12(3):161–175
Google Scholar
Li K, Xiang Z, Chen T, Wang S, Tan KC (2020) Understanding the automated parameter optimization on transfer learning for cpdp: An empirical study. arXiv:2002.03148
Ma Y, Luo G, Zeng X, Chen A (2012) Transfer learning for cross-company software defect prediction. Inf Softw Technol 54(3):248–256
Article Google Scholar
MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, vol 1, pp 281–297
Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27(C):504–518
Article Google Scholar
Marinescu R (2004) Detection strategies: metrics-based rules for detecting design flaws. In: 20th IEEE international conference on software maintenance, 2004. Proceedings., pp 350–359
Martinetz TM, Berkovich SG, Schulten KJ (1993) ’neural-gas’ network for vector quantization and its application to time-series prediction. IEEE Trans Neural Netw 4(4):558–569
Article Google Scholar
McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng 2(4):308–320
Article MathSciNet Google Scholar
McCabe TJ, Butler CW (1989) Design complexity measurement and testing. Commun ACM 32(12):1415–1425
Article Google Scholar
Nam J, Kim S (2015) Clami: Defect prediction on unlabeled datasets (t). In: 2015 30th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 452–463
Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: 2013 35th international conference on software engineering (ICSE). IEEE, pp 382–391
Nam J, Fu W, Kim S, Menzies T, Tan L (2017) Heterogeneous defect prediction. IEEE Trans Softw Eng
Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: Analysis and an algorithm. In: Advances in neural information processing systems. sMIT Press, pp 849–856
Olbrich S, Cruzes DS, Basili V, Zazworka N (2009) The evolution and impact of code smells: A case study of two open source systems. In: 2009 3rd international symposium on empirical software engineering and measurement, pp 390–400
Olbrich SM, Cruzes DS, Sjøberg DIK (2010) Are all code smells harmful? a study of god classes and brain classes in the evolution of three open source systems. In: 2010 IEEE international conference on software maintenance, pp 1–10
Radjenović D, Heričko M, Torkar R, živkovič A (2013) Software fault prediction metrics: A systematic literature review. Inf Softw Technol 55 (8):1397–1418
Article Google Scholar
Rathore SS, Kumar S (2017) A decision tree logic based recommendation system to select software fault prediction techniques. Computing 99(3):255–285
Article MathSciNet Google Scholar
Rodriguez D, Ruiz R, Riquelme JC, Harrison R (2013) A study of subgroup discovery approaches for defect prediction. Inf Softw Technol 55 (10):1810–1822. https://doi.org/10.1016/j.infsof.2013.05.002
Article Google Scholar
Rousseeuw PJ (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article Google Scholar
Seliya N, Khoshgoftaar TM (2007) Software quality analysis of unlabeled program modules with semisupervised clustering. IEEE Trans Syst Man Cybern A Syst Humans 37(2):201–211
Article Google Scholar
Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: Some comments on the nasa software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215
Article Google Scholar
Sjoberg DIK, Yamashita A, Anda B, Mockus A, Dyba T (2013) Quantifying the effect of code smells on maintenance effort. IEEE Trans Softw Eng 39(8):1144–1156. https://doi.org/10.1109/TSE.2012.89
Article Google Scholar
Son L, Pritam N, Khari M, Kumar R, Phuong P, Pham T (2019) Empirical study of software defect prediction: A systematic mapping. Symmetry 11:212
Article Google Scholar
Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14(5):540–578
Article Google Scholar
Wahono RS (2015) A systematic literature review of software defect prediction: research trends, datasets, methods and frameworks. J Softw Eng 1(1):1–16
Google Scholar
Watanabe S, Kaiya H, Kaijiri K (2008) Adapting a fault prediction model to allow inter languagereuse. In: Proceedings of the 4th international workshop on predictor models in software engineering, ser. PROMISE ’08. ACM, New York, pp 19–24
Xu Z, Pang S, Zhang T, Luo X-P, Liu J, Tang Y-T, Yu X, Xue L (2019) Cross project defect prediction via balanced distribution adaptation based transfer learning. J Comput Sci Technol 34(5):1039–1062
Article Google Scholar
Yan M, Fang Y, Lo D, Xia X, Zhang X (2017) File-level defect prediction: Unsupervised vs. supervised models. In: 2017 ACM/IEEE international symposium on empirical software engineering and measurement (ESEM). pp 344–353
Yang J, Qian H (2016) Defect prediction on unlabeled datasets by using unsupervised clustering. In: 2016 IEEE 18th international conference on high performance computing and communications; IEEE 14th international conference on Smart City; IEEE 2nd international conference on data science and systems (HPCC/SmartCity/DSS), pp 465–472
Yang Y, Zhou Y, Liu J, Zhao Y, Lu H, Xu L, Xu B, Leung H (2016) Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models. In: Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering. ACM, pp 157–168
Yang Y, Yang J, Qian H (2018) Defect prediction by using cluster ensembles. In: 2018 tenth international conference on advanced computational intelligence (ICACI), pp 631–636
Yuan X, Khoshgoftaar TM, Allen EB, Ganesan K (2000) An application of fuzzy clustering to software quality prediction. In: Proceedings 3rd IEEE symposium on application-specific systems and software engineering technology, pp 85–90
Zakari A, Lee SP (2019) Simultaneous isolation of software faults for effective fault localization. In: 2019 IEEE 15th international colloquium on signal processing & its applications (CSPA). IEEE, pp 16–20
Zhang J, Wu J, Chen C, Zheng Z, Lyu MR (2020) Cds: A cross–version software defect prediction model with data selection. IEEE Access 8:110059–110072
Article Google Scholar
Zhong Shi, Khoshgoftaar TM, Seliya N (2004) Unsupervised learning for expert-based software quality estimation. In: Eighth IEEE international symposium on high assurance systems engineering, 2004. Proceedings., pp 149–155
Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q (2019) A comprehensive survey on transfer learning. arXiv:1911.02685
Zimmermann T, Nagappan N (2008) Predicting defects using network analysis on dependency graphs. In: 2008 ACM/IEEE 30th international conference on software engineering, pp 531–540

Download references

Author information

Authors and Affiliations

Department of Computer Science, Capital University of Science and Technology, Islamabad, Pakistan
Muhammad Rizwan & Aamer Nadeem
Department of Computer Science, London South Bank University, London, England
Sohail Sarwar & Muddesar Iqbal
Department of Computer Science, GC University, Lahore, Pakistan
Muhammad Safyan
Department of Computer Science, Allama Iqbal Open University, Islamabad, Pakistan
Zia Ul Qayyum

Authors

Muhammad Rizwan
View author publications
You can also search for this author in PubMed Google Scholar
Aamer Nadeem
View author publications
You can also search for this author in PubMed Google Scholar
Sohail Sarwar
View author publications
You can also search for this author in PubMed Google Scholar
Muddesar Iqbal
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Safyan
View author publications
You can also search for this author in PubMed Google Scholar
Zia Ul Qayyum
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sohail Sarwar.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rizwan, M., Nadeem, A., Sarwar, S. et al. EkmEx - an extended framework for labeling an unlabeled fault dataset. Multimed Tools Appl 81, 12141–12156 (2022). https://doi.org/10.1007/s11042-021-11441-7

Download citation

Received: 30 July 2020
Revised: 10 August 2021
Accepted: 17 August 2021
Published: 08 January 2022
Issue Date: April 2022
DOI: https://doi.org/10.1007/s11042-021-11441-7

EkmEx - an extended framework for labeling an unlabeled fault dataset

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Dynamic learner selection for cross-project fault prediction

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks

Automated Tool for Extraction of Software Fault Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

EkmEx - an extended framework for labeling an unlabeled fault dataset

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Dynamic learner selection for cross-project fault prediction

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks

Automated Tool for Extraction of Software Fault Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation