Predicting high-risk program modules by selecting the right software measurements

Kehan Gao¹,
Taghi M. Khoshgoftaar² &
Naeem Seliya³

479 Accesses
Explore all metrics

Abstract

A timely detection of high-risk program modules in high-assurance software is critical for avoiding the high consequences of operational failures. While software risk can initiate from external sources, such as management or outsourcing, software quality is adversely affected when internal software risks are realized, such as improper practice of standard software processes or lack of a defined software quality infrastructure. Practitioners employ various techniques to identify and rectify high-risk or low-quality program modules. Effectiveness of detecting such modules is affected by the software measurements used, making feature selection an important step during software quality prediction. We use a wrapper-based feature ranking technique to select the optimal set of software metrics to build defect prediction models. We also address the adverse effects of class imbalance (very few low-quality modules compared to high-quality modules), a practical problem observed in high-assurance systems. Applying a data sampling technique followed by feature selection is a relatively unique contribution of our work. We present a comprehensive investigation on the impact of data sampling followed by attribute selection on the defect predictors built with imbalanced data. The case study data are obtained from several real-world high-assurance software projects. The key results are that attribute selection is more efficient when applied after data sampling, and defect prediction performance generally improves after applying data sampling and feature selection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Discriminating features-based cost-sensitive approach for software defect prediction

Article Open access 12 July 2021

Increasing the Prediction Quality of Software Defective Modules with Automatic Feature Engineering

Analysis of Feature Ranking Techniques for Defect Prediction in Software Systems

Notes

Positive and negative refer to fault-prone and not-fault-prone modules respectively.

References

Aha, D. W. (1997). Lazy learning. Norwell, MA: Kluwer.
MATH Google Scholar
Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 1573–0565.
Google Scholar
Arbel, R., & Rokach, L. (2006). Classifier evaluation under limited resources. Pattern Recognition Letters, 27(14), 1619–1631.
Article Google Scholar
Barandela, R., Valdovinos, R. M., Sánchez, S. J., & Ferri, F. J. (2004). The imbalanced training sample problem: Under or over sampling?. In Joint IAPR international workshops on structural, syntactic, and statistical pattern recognition (SSPR/SPR’04). Lecture notes in computer science (Vol. 3138, pp. 806–814).
Chawla, N. V. (2003). C4.5 and imbalanced data sets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In Proceedings of the twentieth international conference on machine learning: Workshop on learning from imbalanced datasets II. Washington, DC.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, P. W. (2002) Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
MATH Google Scholar
Cieslak, D. A., Chawla, N. V., & Striegel, A. (2006). Combating imbalance in network intrusion datasets. In Proceedings of 2006 IEEE international conference on granular computing (pp. 732–737). Athens, Georgia.
Davis, J., & Goadrich, M. (2006). The relationship between precision-recall and ROC curves. In Proceedings of the 23rd international conference on machine learning (pp. 233–240). Pittsburgh, Pennsylvania.
Domingos, P., & Pazzani, M. (1997). On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29(2–3): 103–130.
Article MATH Google Scholar
Doraisamy, S., Golzari, S., Norowi, N. M., Sulaiman, N., & Udzir, N. I. (2008). A study on feature selection and classification techniques for automatic genre classification of traditional malay music. In Ninth international conference on music information retrieval, pp. 331–336. Philadelphia, PA.
Drummond, C., & Holte, R. C. (2003). C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced data sets II, international conference on machine learning.
Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceedings of the seventeenth international conference on machine learning (pp. 239–246).
Engen, V., Vincent, J., & Phalp, K. (2008). Enhancing network based intrusion detection for imbalanced data. International Journal of Knowledge-Based and Intelligent Engineering Systems, 12(5–6), 357–367.
Google Scholar
Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. International Journal of Computational Intelligence, 20(1), 18–36.
MathSciNet Google Scholar
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.
Article MathSciNet Google Scholar
Fenton, N. E., & Pfleeger, S. L. (1997) Software metrics: A rigorous and practical approach, (2nd ed.). PWS Publishing Company: ITP, Boston, MA.
Google Scholar
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.
MATH Google Scholar
Gandhi, R., Seok-Won, L. (2007) Visual analytics for requirements-driven risk assessment. In Proceedings of 2nd international workshop on requirements engineering visualization (pp. 6–6). New Delhi, India. ISBN: 978-0-7695-3248-6.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.
MATH Google Scholar
Hall, M. A., & Holmes, G. (2003). Benchmarking attribute selection techniques for discrete class data mining. IEEE transactions on knowledge and data engineering, 15(6), 1437–1447.
Article Google Scholar
Haykin S. (1998) Neural Networks: A comprehensive foundation, (2nd ed.). Prentice-Hall.
Hudepohl, J. P., Aud, S. J., Khoshgoftaar, T. M., Allen, E. B., & Mayrand, J. (1996). Emerald: Software metrics and models on the desktop. IEEE Software 13(5), 56–60.
Article Google Scholar
Ilczuk, G., Mlynarski, R., Kargul, W., & Wakulicz-Deja, A. (2007). New feature selection methods for qualification of the patients for cardiac pacemaker implantation. In Computers in Cardiology (pp. 423–426). Durham, NC.
Imam, K. E., Benlarbi, S., Goel, N., Rai, S. N. (2001). Comparing case-based reasoning classifiers for predicting high-risk software componenets. Journal of Systems and Software, 55(3), 301–320. (Elsevier Science Publishing).
Google Scholar
Jansma, P. (2005). When management gets serious about managing software. In Proceedings of the 2005 IEEE aerospace conference (pp. 4366–4382). Big Sky, MT (2005). Software Quality Improvement Project, Jet Propulsion Lab., Pasadena, CA.
John, G. H., & Langley, P. (1995). Estimating continuous distributions in bayesian classifiers. In Proceedings of eleventh conference on uncertainty in artificial intelligence (Vol. 2, pp. 338–345). San Mateo.
Jong, K., Marchiori, E., Sebag, M., & van der Vaart, A. (2004). Feature selection in proteomic pattern data with support vector machines. In Proceedings of the 2004 IEEE symposium on computational intelligence in bioinformatics and computational Biology.
Kamal, A. H., Zhu, X., Pandya, A. S., Hsu, S., & Shoaib, M. (2009). The impact of gene selection on imbalanced microarray expression data. In Proceedings of the 1st international conference on bioinformatics and computational biology. Lecture Notes in Bioinformatics (Vol. 5462, pp. 259–269). New Orleans, LA.
Khoshgoftaar, T. M., Bullard, L. A., & Gao, K. (2009). Attribute selection using rough sets in software quality classification. International Journal of Reliability Quality and Safty Engineering, 16(1), 73–89.
Article Google Scholar
Khoshgoftaar, T. M., Golawala, M., Van Hulse, J. (2007). An empirical study of learning from imbalanced data using random forest. In Proceedings of the 19th IEEE international conference on tools with artificial intelligence (Vol. 2, pp. 310–317). IEEE Computer Society, Patras.
Khoshgoftaar, T. M., Seiffert, C., Van Hulse, J., Napolitano, A., & Folleco, A. (2007). Learning with limited minority class data. In Proceedings of the IEEE international conference on machine learning and applications (pp. 348–353). IEEE Computer Society, Cincinnati, OH, USA.
Khoshgoftaar, T. M., & Seliya, N. (2004) Comparitive assessment of software quality classification technique. 9Empirical Sofware Engineering Journal(3), 229–257.
Article Google Scholar
Khoshgoftaar, T. M., Seliya, N., & Gao, K. (2005) Detecting noisy instances with the rule-based classification model. Intelligent Data Analysis, 9(4), 347–364.
Google Scholar
Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (2006) Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 30(1).
Le Cessie, S., & Van Houwelingen, J. C. (1992). Ridge estimators in logistic regression. Applied Statistics, 41(1), 191–201.
Article MATH Google Scholar
Lessmann, S., Baesens, B., Mues, C., & Pietsch, S. (2008). Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 34(4), 485–496.
Google Scholar
Liu, H., Motoda, H., & Yu, L. (2004). A selective sampling approach to active feature selection. Artificial Intelligence, 159(1–2), 49–74.
Article MATH MathSciNet Google Scholar
Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4), 491–502.
Article Google Scholar
Ma, Y., & Cukic, B. (2007) Adequate and precise evaluation of quality models in software engineering studies. In Proceedings of the third International workshop on predictor models in software engineering. IEEE Computer Society, Washington, DC, USA.
Ping, Y., Systa, T., & Muller, H. (2002). Predicting fault-proneness using OO metrics: An industrial case study. In T. Gyimothy, F. B. Abreu (Eds.) Proceedings of 6th European conference on software maintenance and reengineering, (pp. 99–107). Budapest, Hungary.
Platt, J.C. (1999). Advances in kernel methods—support vector learning. In Fast training of support vector machines using sequential minimal optimization (pp. 185–208). MIT Press.
Rodriguez, D., Ruiz, R., Cuadrado-Gallego, J., & Aguilar-Ruiz, J. (2007). Detecting fault modules applying feature selection to classifiers. In Proceedings of 8th IEEE international conference on information reuse and integration (pp. 667–672). Las Vegas, Nevada.
Runeson, P., Ohlsson, M. C., & Wohlin, C. (2001). A classification scheme for studies on fault-prone components. Lecture Notes in Computer Science, 2188, 341–355. (Springer Link).
Saeys, Y., Abeel, T., & Peer, Y. (2008). Robust feature selection using ensemble feature selection techniques. In Proceedings of the European conference on machine learning and knowledge discovery in databases—Part II (pp. 313–325).
Seiffert, C., Khoshgoftaar, T. M., & Van Hulse, J. (2009) Hybrid sampling for imbalanced data. International Journal of Integrated Computer-Aided Engineering, 16(3), 193–210.
Google Scholar
Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). Rusboost: A hybrid approach to alleviate class imbalance. IEEE transactions on systems, man and cybernetics: Part A: Systems and Humans, 40(1), 185–197.
Google Scholar
Shawe-Taylor, J., & Cristianini, N. (2000). Support Vector Machines, (2nd ed.). Cambridge: Cambridge University Press.
Google Scholar
Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, f-score and ROC: A family of discriminant measures for performance evaluation. In Proceedings of the Australian conference on artificial intelligence (pp. 1015–1021).
Van Hulse, J., & Khoshgoftaar, T. M. (2009). Knowledge discovery from imbalanced and noisy data. Data and Knowledge Engineering , 68(12), 1513–1542.
Article Google Scholar
Van Hulse, J., Khoshgoftaar, T. M., Napolitano, A. (2007). Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th international conference on machine learning, ICML 2007 (pp. 935–942). Corvallis, OR.
Votta, L. G., & Porter, A. A. (1995). Experimental software engineering: A report on the state of the art. In Proceedings of the 17th international conference on software engineering (pp. 277–279). IEEE Computer Society, Seattle, WA.
Wilson, D. (1972). Asymptotic properties of nearest neighbor rules using edited data sets. IEEE Transactions on Systems, Man and Cybernetics, 2, 408C421.
Article MATH Google Scholar
Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques, (2nd ed.). Morgan Kaufmann.
Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., & Wesslen, A. (2000). Experimentation in software engineering: An introduction. Kluwer International series in software engineering. Boston, MA: Kluwer Academic Publishers.
MATH Google Scholar
Zhao, X. M., Li, X., Chen, L., & Aihara, K. (2007). Protein classification with imbalanced data. Proteins: Structure, function, and bioinformatics 70(4): 1125–1132.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Eastern Connecticut State University, 83 Windham Street, Willimantic, CT, 06226, USA
Kehan Gao
Florida Atlantic University, 777 W. Glades Road, Boca Raton, FL, 33431, USA
Taghi M. Khoshgoftaar
University of Michigan—Dearborn, 4901 Evergreen Road, Dearborn, MI, 48128, USA
Naeem Seliya

Authors

Kehan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Taghi M. Khoshgoftaar
View author publications
You can also search for this author in PubMed Google Scholar
Naeem Seliya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kehan Gao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, K., Khoshgoftaar, T.M. & Seliya, N. Predicting high-risk program modules by selecting the right software measurements. Software Qual J 20, 3–42 (2012). https://doi.org/10.1007/s11219-011-9132-0

Download citation

Published: 26 February 2011
Issue Date: March 2012
DOI: https://doi.org/10.1007/s11219-011-9132-0

Predicting high-risk program modules by selecting the right software measurements

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Discriminating features-based cost-sensitive approach for software defect prediction

Increasing the Prediction Quality of Software Defective Modules with Automatic Feature Engineering

Analysis of Feature Ranking Techniques for Defect Prediction in Software Systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Predicting high-risk program modules by selecting the right software measurements

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Discriminating features-based cost-sensitive approach for software defect prediction

Increasing the Prediction Quality of Software Defective Modules with Automatic Feature Engineering

Analysis of Feature Ranking Techniques for Defect Prediction in Software Systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation