Abstract
Using software metrics as a method of quantification of software, various approaches were proposed for locating defect-prone source code units within software projects. Most of these approaches rely on supervised learning algorithms, which require labeled data for adjusting their parameters during the learning phase. Usually, such labeled training data is not available. Unsupervised algorithms do not require training data and can therefore help to overcome this limitation.
In this work, we evaluate the effect of unsupervised learning by means of cluster-based algorithms and outlier mining algorithms for the task of defect prediction, i.e., locating defect-prone source code units. We investigate the effect of various class balancing and feature compressing techniques as preprocessing steps and show how sliding windows can be used to capture time series of source code metrics. We evaluate the Isolation Forest and Local Outlier Factor, as representants of outlier mining techniques. Our experiments on three publicly available datasets, containing a total of 11 software projects, indicate that the consideration of time series can improve static examinations by up to 3%. The results further show that supervised algorithms can outperform unsupervised approaches on all projects. Among all unsupervised approaches, the Isolation Forest achieves the best accuracy on 10 out of 11 projects.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
For completeness, here, we also evaluated the possibility to use no Balancing or no Feature Compression technique. Those results are—as expected—weaker (cf. auxiliary material).
References
Adam, S.P., Alexandropoulos, S.-A.N., Pardalos, P.M., Vrahatis, M.N.: No free lunch theorem: a review. In: Demetriou, I.C., Pardalos, P.M. (eds.) Approximation and Optimization. SOIA, vol. 145, pp. 57–82. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12767-1_5
Albahli, S.: A deep ensemble learning method for effort-aware just-in-time defect prediction. Future Internet 11(12), 246 (2019). https://doi.org/10.3390/fi11120246
Amasaki, S.: Cross-version defect prediction using cross-project defect prediction approaches: does it work? In: Proc. 14th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2018), pp. 32–41. ACM (2018). https://doi.org/10.1145/3273934.3273938
Bennin, K.E., Keung, J., Monden, A., Kamei, Y., Ubayashi, N.: Investigating the effects of balanced training and testing datasets on effort-aware fault prediction models. In: Proc. 40th Annual Computer Software and Applications Conference (COMPSAC 2016), pp. 154–163. IEEE (2016). https://doi.org/10.1109/COMPSAC.2016.144
Bennin, K.E., Keung, J., Phannachitta, P., Monden, A., Mensah, S.: Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans. Softw. Eng. 44(6), 534–550 (2018). https://doi.org/10.1109/TSE.2017.2731766
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(10), 281–305 (2012). https://jmlr.org/papers/v13/bergstra12a.html
Bowes, D., Hall, T., Gray, D.: Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix. In: Proc. 8th International Conference on Predictive Models in Software Engineering (PROMISE 2012), pp. 109–118. ACM (2012). https://doi.org/10.1145/2365324.2365338
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002). https://doi.org/10.1613/jair.953
Chi, Y., Wang, H., Yu, P.S., Muntz, R.R.: Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl. Inf. Syst. 10(3), 265–294 (2006). https://doi.org/10.1007/s10115-006-0003-0
Ding, Z., Fei, M.: An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window. IFAC Proc. Vol. 46(20), 12–17 (2013). https://doi.org/10.3182/20130902-3-CN-3020.00044
Ferenc, R., Tóth, Z., Ladányi, G., Siket, I., Gyimóthy, T.: A public unified bug dataset for java. In: Proc. 14th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2018), pp. 12–21. ACM (2018). https://doi.org/10.1145/3273934.3273936
Fu, W., Menzies, T.: Revisiting unsupervised learning for defect prediction. In: Proc. 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017), pp. 72–83. ACM (2017). https://doi.org/10.1145/3106237.3106257
Hawkins, D.M.: The problem of overfitting. J. Chem. Inf. Comput. Sci. 44(1), 1–12 (2004). https://doi.org/10.1021/ci0342472
He, Z., Fan, B., Cheng, T., Wang, S.Y., Tan, C.H.: A mean-shift algorithm for large-scale planar maximal covering location problems. Eur. J. Oper. Res. 250(1), 65–76 (2016). https://doi.org/10.1016/j.ejor.2015.09.006
Hemmati, H., et al.: The MSR cookbook: mining a decade of research. In: Proc. 10th Working Conference on Mining Software Repositories (MSR 2013), pp. 343–352. IEEE (2013). https://doi.org/10.1109/MSR.2013.6624048
Huang, Q., Xia, X., Lo, D.: Supervised vs unsupervised models: a holistic look at effort-aware just-in-time defect prediction. In: Proc. International Conference on Software Maintenance and Evolution (ICSME 2017), pp. 159–170. IEEE (2017). https://doi.org/10.1109/ICSME.2017.51
Jiarpakdee, J., Tantithamthavorn, C., Hassan, A.E.: The impact of correlated metrics on the interpretation of defect models. IEEE Trans. Softw. Eng. 47(2), 320–331 (2021). https://doi.org/10.1109/TSE.2019.2891758
Jureczko, M., Madeyski, L.: Towards identifying software project clusters with regard to defect prediction. In: Proc. 6th International Conference on Predictive Models in Software Engineering (PROMISE 2010). ACM (2010). https://doi.org/10.1145/1868328.1868342
Kondo, M., Bezemer, C.-P., Kamei, Y., Hassan, A.E., Mizuno, O.: The impact of feature reduction techniques on defect prediction models. Empir. Softw. Eng. 24(4), 1925–1963 (2019). https://doi.org/10.1007/s10664-018-9679-5
Li, N., Shepperd, M., Guo, Y.: A systematic review of unsupervised learning techniques for software defect prediction. Inf. Softw. Technol. 122, 106287 (2020). https://doi.org/10.1016/j.infsof.2020.106287
Li, Z., Jing, X.Y., Zhu, X.: Progress on approaches to software defect prediction. IET Softw. 12(3), 161–175 (2018). https://doi.org/10.1049/iet-sen.2017.0148
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: Proc. 8th International Conference on Data Mining (ICDM 2008), pp. 413–422. IEEE (2008). https://doi.org/10.1109/ICDM.2008.17
Liu, Y., Li, Y., Guo, J., Zhou, Y., Xu, B.: Connecting software metrics across versions to predict defects. In: Proc. 25th International Conference on Software Analysis, Evolution and Reengineering (SANER 2018), pp. 232–243. IEEE (2018). https://doi.org/10.1109/SANER.2018.8330212
Mahmood, Z., Bowes, D., Lane, P.C.R., Hall, T.: What is the impact of imbalance on software defect prediction performance? In: Proc. 11th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2015), pp. 1–4. ACM (2015). https://doi.org/10.1145/2810146.2810150
Mende, T.: Replication of defect prediction studies: problems, pitfalls and recommendations. In: Proc. 6th International Conference on Predictive Models in Software Engineering (PROMISE 2010), pp. 1–10. ACM (2010). https://doi.org/10.1145/1868328.1868336
Miles, J.: Tolerance and Variance Inflation Factor. Wiley (2014). https://doi.org/10.1002/9781118445112.stat06593
Moshtari, S., Santos, J.C., Mirakhorli, M., Okutan, A.: Looking for software defects? First find the nonconformists. In: Proc. 20th International Working Conference on Source Code Analysis and Manipulation (SCAM 2020), pp. 75–86. IEEE (2020). https://doi.org/10.1109/SCAM51674.2020.00014
Nagappan, N., Zeller, A., Zimmermann, T., Herzig, K., Murphy, B.: Change bursts as defect predictors. In: Proc. 21st International Symposium on Software Reliability Engineering (ISSRE 2010), pp. 309–318. IEEE (2010). https://doi.org/10.1109/ISSRE.2010.25
Nam, J., Kim, S.: CLAMI: defect prediction on unlabeled datasets. In: Proc. 30th IEEE/ACM International Conference on Automated Software Engineering (ASE 2015), pp. 452–463 (2015). https://doi.org/10.1109/ASE.2015.56
Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)
Rathore, S.S., Kumar, S.: A study on software fault prediction techniques. Artif. Intell. Rev. 51(2), 255–327 (2017). https://doi.org/10.1007/s10462-017-9563-5
Runeson, P.: A survey of unit testing practices. IEEE Softw. 23(4), 22–29 (2006). https://doi.org/10.1109/MS.2006.91
Saravanan, R., Sujatha, P.: A state of art techniques on machine learning algorithms: a perspective of supervised learning approaches in data classification. In: 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 945–949 (2018). https://doi.org/10.1109/ICCONS.2018.8663155
Sayyad Shirabad, J., Menzies, T.: The PROMISE repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa, Canada (2005)
Tantithamthavorn, C., Hassan, A.E., Matsumoto, K.: The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Trans. Softw. Eng. 46(11), 1200–1219 (2020). https://doi.org/10.1109/TSE.2018.2876537
Verleysen, M., François, D.: The curse of dimensionality in data mining and time series prediction. In: Cabestany, J., Prieto, A., Sandoval, F. (eds.) IWANN 2005. LNCS, vol. 3512, pp. 758–770. Springer, Heidelberg (2005). https://doi.org/10.1007/11494669_93
Xu, Z., et al.: Clustering-based unsupervised models, data analytics for defect prediction, empirical study. J. Syst. Softw. 172, 110862 (2021). https://doi.org/10.1016/j.jss.2020.110862
Yan, M., Fang, Y., Lo, D., Xia, X., Zhang, X.: File-level defect prediction: unsupervised vs. supervised models. In: Proc. ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM 2017), pp. 344–353. IEE/ACM (2017). https://doi.org/10.1109/ESEM.2017.48
Yang, Y., et al.: Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models. In: Proc. 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016), pp. 157–168. ACM (2016). https://doi.org/10.1145/2950290.2950353
Zhang, F., Zheng, Q., Zou, Y., Hassan, A.E.: Cross-project defect prediction using a connectivity-based unsupervised classifier. In: Proc. IEEE/ACM 38th International Conference on Software Engineering (ICSE 2016), pp. 309–320 (2016). https://doi.org/10.1145/2884781.2884839
Zhu, K., Zhang, N., Ying, S., Zhu, D.: Within-project and cross-project just-in-time defect prediction based on denoising autoencoder and convolutional neural network. IET Softw. 14(3), 185–195 (2020). https://doi.org/10.1049/iet-sen.2019.0278
Zimmermann, T., Nagappan, N., Gall, H., Giger, E., Murphy, B.: Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proc. 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2009), pp. 91–100. ACM (2009). https://doi.org/10.1145/1595696.1595713
Acknowledgement
We thank the anonymous reviewers for their valuable feedback. This work was partially funded by the German Ministry for Education and Research (BMBF) through grants 01IS20088B (“KnowhowAnalyzer”) and 01IS22062 (“AI research group FFS-AI”).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Cech, T., Atzberger, D., Scheibel, W., Misra, S., Döllner, J. (2023). Outlier Mining Techniques for Software Defect Prediction. In: Mendez, D., Winkler, D., Kross, J., Biffl, S., Bergsmann, J. (eds) Software Quality: Higher Software Quality through Zero Waste Development. SWQD 2023. Lecture Notes in Business Information Processing, vol 472. Springer, Cham. https://doi.org/10.1007/978-3-031-31488-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-31488-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31487-2
Online ISBN: 978-3-031-31488-9
eBook Packages: Computer ScienceComputer Science (R0)