[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2810146.2810150acmotherconferencesArticle/Chapter ViewAbstractPublication PagespromiseConference Proceedingsconference-collections
short-paper

What is the Impact of Imbalance on Software Defect Prediction Performance?

Published: 21 October 2015 Publication History

Abstract

Software defect prediction performance varies over a large range. Menzies suggested there is a ceiling effect of 80% Recall [8]. Most of the data sets used are highly imbalanced. This paper asks, what is the empirical effect of using different datasets with varying levels of imbalance on predictive performance? We use data synthesised by a previous meta-analysis of 600 fault prediction models and their results. Four model evaluation measures (the Mathews Correlation Coefficient (MCC), F-Measure, Precision and Recall) are compared to the corresponding data imbalance ratio. When the data are imbalanced, the predictive performance of software defect prediction studies is low. As the data become more balanced, the predictive performance of prediction models increases, from an average MCC of 0.15, until the minority class makes up 20% of the instances in the dataset, where the MCC reaches an average value of about 0.34. As the proportion of the minority class increases above 20%, the predictive performance does not significantly increase. Using datasets with more than 20% of the instances being defective has not had a significant impact on the predictive performance when using MCC. We conclude that comparing the results of defect prediction studies should take into account the imbalance of the data.

References

[1]
G. Batista, D. Silva, and R. Prati. An experimental design to evaluate class imbalance treatment methods. In Machine Learning and Applications (ICMLA), 2012 11th International Conference on, volume 2, pages 95--101. IEEE, 2012.
[2]
R. Blagus and L. Lusa. Evaluation of smote for high-dimensional class-imbalanced microarray data. In Machine learning and applications (icmla), 2012 11th international conference on, volume 2, pages 89--94. IEEE, 2012.
[3]
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, pages 321--357, 2002.
[4]
D. P. H. Gray. Software defect prediction using static code metrics:formulating a methodology. PhD thesis, University of Hertfordshire, 2013.
[5]
T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A systematic literature review on fault prediction performance in software engineering. Software Engineering, IEEE Transactions on, 38(6):1276--1304, Nov 2012.
[6]
P. C. R. Lane, D. Clarke, and P. Hender. On developing robust models for favourability analysis: Model choice, feature sets and imbalanced data. Decision Support Systems, 53(4):712--718, 2012.
[7]
M. Levinson. Let's stop wasting $78 billion a year'. CIO, 15th October, pages 78--83, 2001.
[8]
T. Menzies, B. Turhan, A. Bener, G. Gay, B. Cukic, and Y. Jiang. Implications of ceiling effects in defect predictors. In Proceedings of the 4th international workshop on Predictor models in software engineering, pages 47--54. ACM, 2008.
[9]
D. Rodriguez, I. Herraiz, R. Harrison, J. Dolado, and J. C. Riquelme. Preliminary comparison of techniques for dealing with imbalance in software defect prediction. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, EASE '14, pages 43:1--43:10, New York, NY, USA, 2014. ACM.
[10]
P. Runeson and A. Andrews. Detection or isolation of defects? an experimental comparison of unit testing and code inspection. In Software Reliability Engineering, 2003. ISSRE 2003. 14th International Symposium on, pages 3--13. IEEE, 2003.
[11]
M. Shepperd, D. Bowes, and T. Hall. Researcher bias: The use of machine learning in software defect prediction. Software Engineering, IEEE Transactions on, 40(6):603--616, June 2014.
[12]
J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano. Experimental perspectives on learning from imbalanced data. In V. Ghahramani, editor, Proceedings of the 24th international conference on Machine learning, pages 935--942. ACM, 2007.

Cited By

View all
  • (2025)The effect of data complexity on classifier performanceEmpirical Software Engineering10.1007/s10664-024-10554-530:1Online publication date: 1-Feb-2025
  • (2024)A Software Defect Prediction Method That Simultaneously Addresses Class Overlap and Noise Issues after OversamplingElectronics10.3390/electronics1320397613:20(3976)Online publication date: 10-Oct-2024
  • (2024)Software Defect Prediction Approach Based on a Diversity Ensemble Combined With Neural NetworkIEEE Transactions on Reliability10.1109/TR.2024.335651573:3(1487-1501)Online publication date: Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
PROMISE '15: Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering
October 2015
63 pages
ISBN:9781450337151
DOI:10.1145/2810146
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data Imbalance
  2. Defect Prediction
  3. Machine Learning

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

PROMISE '15

Acceptance Rates

PROMISE '15 Paper Acceptance Rate 8 of 16 submissions, 50%;
Overall Acceptance Rate 98 of 213 submissions, 46%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)The effect of data complexity on classifier performanceEmpirical Software Engineering10.1007/s10664-024-10554-530:1Online publication date: 1-Feb-2025
  • (2024)A Software Defect Prediction Method That Simultaneously Addresses Class Overlap and Noise Issues after OversamplingElectronics10.3390/electronics1320397613:20(3976)Online publication date: 10-Oct-2024
  • (2024)Software Defect Prediction Approach Based on a Diversity Ensemble Combined With Neural NetworkIEEE Transactions on Reliability10.1109/TR.2024.335651573:3(1487-1501)Online publication date: Sep-2024
  • (2024)Ensemble Learning Applications in Software Fault PredictionProceedings of International Joint Conference on Advances in Computational Intelligence10.1007/978-981-97-0180-3_41(533-543)Online publication date: 2-Apr-2024
  • (2023)Bayesian Meta-Analysis of Software Defect Prediction With Machine LearningIEEE Transactions on Industrial Cyber-Physical Systems10.1109/TICPS.2023.33067231(147-156)Online publication date: 2023
  • (2023)Early Diabetes prediction with optimal feature selection using ML based Prediction Framework2023 4th International Conference on Signal Processing and Communication (ICSPC)10.1109/ICSPC57692.2023.10125956(391-395)Online publication date: 23-Mar-2023
  • (2023)Just-in-time defect prediction for mobile applications: using shallow or deep learning?Software Quality Journal10.1007/s11219-023-09629-131:4(1281-1302)Online publication date: 9-Jun-2023
  • (2023)Outlier Mining Techniques for Software Defect PredictionSoftware Quality: Higher Software Quality through Zero Waste Development10.1007/978-3-031-31488-9_3(41-60)Online publication date: 13-May-2023
  • (2022)A Survey of Different Approaches for the Class Imbalance Problem in Software Defect PredictionInternational Journal of Software Science and Computational Intelligence10.4018/IJSSCI.30126814:1(1-26)Online publication date: 3-Jun-2022
  • (2022) Eliminating the high false‐positive rate in defect prediction through BayesNet with adjustable weight Expert Systems10.1111/exsy.1297739:6Online publication date: 4-Mar-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media