More Web Proxy on the site http://driver.im/

research-article

Data Complexity: A New Perspective for Analyzing the Difficulty of Defect Prediction Tasks

Authors:

Xuhui LuAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 33, Issue 6

Article No.: 141, Pages 1 - 45

https://doi.org/10.1145/3649596

Published: 27 June 2024 Publication History

Abstract

Defect prediction is crucial for software quality assurance and has been extensively researched over recent decades. However, prior studies rarely focus on data complexity in defect prediction tasks, and even less on understanding the difficulties of these tasks from the perspective of data complexity. In this article, we conduct an empirical study to estimate the hardness of over 33,000 instances, employing a set of measures to characterize the inherent difficulty of instances and the characteristics of defect datasets. Our findings indicate that: (1) instance hardness in both classes displays a right-skewed distribution, with the defective class exhibiting a more scattered distribution; (2) class overlap is the primary factor influencing instance hardness and can be characterized through feature, structural, and instance-level overlap; (3) no universal preprocessing technique is applicable to all datasets, and it may not consistently reduce data complexity, fortunately, dataset complexity measures can help identify suitable techniques for specific datasets; (4) integrating data complexity information into the learning process can enhance an algorithm’s learning capacity. In summary, this empirical study highlights the crucial role of data complexity in defect prediction tasks, and provides a novel perspective for advancing research in defect prediction techniques.

References

[1]

Amritanshu Agrawal and Tim Menzies. 2018. Is “better data” better than “better data miners”? on the benefits of tuning smote for defect prediction. In Proceedings of the 40th International Conference on Software Engineering. 1050–1061.

Digital Library

[2]

Núria Macià Antolínez. 2011. Data Complexity in Supervised Learning: A Far-Reaching Implication. Ph. D. Dissertation. Universitat Ramon Llull.

[3]

José L. M. Arruda, Ricardo B. C. Prudêncio, and Ana C. Lorena. 2020. Measuring instance hardness using data complexity measures. In Proceedings of the 9th Brazilian Conference on Intelligent Systems. 483–497.

Digital Library

[4]

Gustavo EAPA Batista, Ana L. C. Bazzan, and Maria Carolina Monard. 2003. Balancing training data for automated annotation of keywords: A case study. In Proceedings of the WOB. 10–18.

[5]

Gustavo EAPA Batista, Ronaldo C. Prati, and Maria Carolina Monard. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6, 1 (2004), 20–29.

Digital Library

[6]

Kwabena Ebo Bennin, Jacky Keung, Passakorn Phannachitta, Akito Monden, and Solomon Mensah. 2017. Mahakil: Diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Transactions on Software Engineering 44, 6 (2017), 534–550.

Digital Library

[7]

James Bergstra, Daniel Yamins, and David Cox. 2013. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the 30th International Conference on International Conference on Machine Learning. 115–123.

[8]

Alex Berson, Stephen Smith, and Kurt Thearling. 2004. An overview of data mining techniques. Building Data Mining Application for CRM (2004).

[9]

Alceu S. Britto Jr, Robert Sabourin, and Luiz ES Oliveira. 2014. Dynamic selection of classifiers–a comprehensive review. Pattern Recognition 47, 11 (2014), 3665–3680.

[10]

André L. Brun, Alceu S. Britto Jr, Luiz S. Oliveira, Fabricio Enembreck, and Robert Sabourin. 2018. A framework for dynamic classifier selection oriented by the classification problem difficulty. Pattern Recognition 76 (2018), 175–190.

Digital Library

[11]

Edgar Brunner, Ullrich Munzel, and Madan L. Puri. 2002. The multivariate nonparametric behrens–fisheri problem. Journal of Statistical Planning and Inference 108, 1-2 (2002), 37–53.

[12]

Cagatay Catal and Banu Diri. 2009. A systematic review of software fault prediction studies. Expert Systems with Applications 36, 4 (2009), 7346–7354.

Digital Library

[13]

Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16 (2002), 321–357.

[14]

Haowen Chen, Xiao-Yuan Jing, Zhiqiang Li, Di Wu, Yi Peng, and Zhiguo Huang. 2020. An empirical study on heterogeneous defect prediction approaches. IEEE Transactions on Software Engineering 47, 12 (2020), 2803–2822.

[15]

Halimu Chongomweru and Asem Kasem. 2021. A novel ensemble method for classification in imbalanced datasets using split balancing technique based on instance hardness (sBal_IH). Neural Computing and Applications 33, 17 (2021), 11233–11254.

Digital Library

[16]

Marco D’Ambros, Michele Lanza, and Romain Robbes. 2010. An extensive comparison of bug prediction approaches. In Proceedings of the 7th IEEE Working Conference on Mining Software Repositories. 31–41.

[17]

James D. Evans. 1996. Straightforward Statistics for the Behavioral Sciences.Thomson Brooks/Cole Publishing Co.

[18]

Wei Fu, Tim Menzies, and Xipeng Shen. 2016. Tuning for software analytics: Is it really necessary? Information and Software Technology 76 (2016), 135–146.

Digital Library

[19]

Luis P. F. Garcia, Adriano Rivolli, Edesio Alcoba, Ana C. Lorena, and André CPLF de Carvalho. 2020. Boosting meta-learning with simulated data complexity measures. Intelligent Data Analysis 24, 5 (2020), 1011–1028.

Digital Library

[20]

Baljinder Ghotra, Shane McIntosh, and Ahmed E. Hassan. 2015. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proceedings of the 37th International Conference on Software Engineering. 789–800.

[21]

Baljinder Ghotra, Shane McIntosh, and Ahmed E. Hassan. 2017. A large-scale study of the impact of feature selection techniques on defect classification models. In Proceedings of the IEEE/ACM 14th International Conference on Mining Software Repositories. 146–157.

Digital Library

[22]

Lina Gong, Shujuan Jiang, Lili Bo, Li Jiang, and Junyan Qian. 2019. A novel class-imbalance learning approach for both within-project and cross-project defect prediction. IEEE Transactions on Reliability 69, 1 (2019), 40–54.

[23]

Lina Gong, Shujuan Jiang, Rongcun Wang, and Li Jiang. 2020. Empirical evaluation of the impact of class overlap on software defect prediction. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. 698–709.

[24]

Lina Gong, Haoxiang Zhang, Jingxuan Zhang, Mingqiang Wei, and Zhiqiu Huang. 2022. A comprehensive investigation of the impact of class overlap on software defect prediction. IEEE Transactions on Software Engineering 49, 4 (2022), 2440–2458.

Digital Library

[25]

Arnulf B. A. Graf and Silvio Borer. 2001. Normalization in support vector machines. In Proceedings of the 23rd DAGM-Symposium on Pattern Recognition. 277–282.

[26]

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning. 1321–1330.

Digital Library

[27]

Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. 2012. A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering 38, 6 (2012), 1276–1304.

Digital Library

[28]

Ahmed E. Hassan. 2009. Predicting faults using the complexity of code changes. In Proceedings of the 31st International Conference on Software Engineering. 78–88.

Digital Library

[29]

Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21, 9 (2009), 1263–1284.

Digital Library

[30]

Steffen Herbold, Alexander Trautsch, and Jens Grabowski. 2018. A comparative study to benchmark cross-project defect prediction approaches. In Proceedings of the 40th International Conference on Software Engineering. 1063–1063.

Digital Library

[31]

Tin Kam Ho and M. Basu. 2002. Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 3 (2002), 289–300.

Digital Library

[32]

Marian Jureczko and Lech Madeyski. 2010. Towards identifying software project clusters with regard to defect prediction. In Proceedings of the 6th International Conference on Predictive Models in Software Engineering. 1–10.

Digital Library

[33]

Ahmedul Kabir, Carolina Ruiz, and Sergio A. Alvarez. 2018. Mixed bagging: A novel ensemble learning framework for supervised classification based on instance hardness. In Proceedings of the IEEE International Conference on Data Mining. 1073–1078.

[34]

Yasutaka Kamei and Emad Shihab. 2016. Defect Prediction: Accomplishments and future challenges. In Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution and Reengineering. 33–45.

[35]

Sunghun Kim, E. James Whitehead Jr, and Yi Zhang. 2008. Classifying software changes: Clean or buggy? IEEE Transactions on Software Engineering 34, 2 (2008), 181–196.

Digital Library

[36]

Barbara Kitchenham, Lech Madeyski, David Budgen, Jacky Keung, Pearl Brereton, Stuart Charters, Shirley Gibbs, and Amnart Pohthong. 2017. Robust statistical methods for empirical software engineering. Empirical Software Engineering 22 (2017), 579–630.

Digital Library

[37]

Joanna Komorniczak and Paweł Ksieniewicz. 2023. problexity–An open-source Python library for supervised learning problem complexity assessment. Neurocomputing 521 (2023), 126–136.

Digital Library

[38]

Sotiris B. Kotsiantis, Ioannis Zaharakis, and P. Pintelas. 2007. Supervised machine learning: A review of classification techniques. Emerging Artificial Intelligence Applications in Computer Engineering 160, 1 (2007), 3–24.

[39]

Volodymyr Kuleshov and Percy Liang. 2015. Calibrated structured prediction. In Proceedings of the 28th Conference on Neural Information Processing Systems. 3474–3482.

[40]

Jun Won Lee and Christophe Giraud-Carrier. 2011. A metric for unsupervised metalearning. Intelligent Data Analysis 15, 6 (2011), 827–841.

Digital Library

[41]

Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research 18, 17 (2017), 1–5.

Digital Library

[42]

Enrique Leyva, Antonio González, and Raul Perez. 2014. A set of complexity measures designed for applying meta-learning to instance selection. IEEE Transactions on Knowledge and Data Engineering 27, 2 (2014), 354–367.

[43]

Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM Computing Surveys 50, 6 (2017), 1–45.

[44]

Ke Li, Zilin Xiang, Tao Chen, Shuo Wang, and Kay Chen Tan. 2020. Understanding the automated parameter optimization on transfer learning for cross-project defect prediction: An empirical study. In Proceedings of the 42nd International Conference on Software Engineering. 566–577.

Digital Library

[45]

Ling Li and Yaser S. Abu-Mostafa. 2006. Data Complexity in Machine Learning. Technical Report CaltechCSTR:2006.004. California Institute of Technology.

[46]

Zhiqiang Li, Xiao-Yuan Jing, and Xiaoke Zhu. 2018. Progress on approaches to software defect prediction. IET Software 12, 3 (2018), 161–175.

Digital Library

[47]

Zhining Liu, Wei Cao, Zhifeng Gao, Jiang Bian, Hechang Chen, Yi Chang, and Tieyan Liu. 2020. Self-paced ensemble for highly imbalanced massive data classification. In Proceedings of the 36th IEEE International Conference on Data Engineering. 841–852.

[48]

Zhining Liu, Pengfei Wei, Jing Jiang, Wei Cao, Jiang Bian, and Yi Chang. 2020. MESA: Boost ensemble imbalanced learning with meta-sampler. In Proceedings of the 34th Conference on Neural Information Processing Systems. 14463–14474.

[49]

Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin Kam Ho. 2019. How complex is your classification problem? a survey on measuring classification complexity. ACM Computing Surveys 52, 5 (2019), 1–34.

Digital Library

[50]

Ana C. Lorena, Pedro Y. A. Paiva, and Ricardo B. C. Prudêncio. 2023. Trusting my predictions: On the value of Instance-Level analysis. ACM Computing Surveys (2023).

[51]

Yang Lu, Yiu-Ming Cheung, and Yuan Yan Tang. 2019. Bayes imbalance impact index: A measure of class imbalanced data set for classification problem. IEEE Transactions on Neural Networks and Learning Systems 31, 9 (2019), 3525–3539.

[52]

Julián Luengo and Francisco Herrera. 2015. An automatic extraction method of the domains of competence for learning classifiers using data complexity measures. Knowledge and Information Systems 42 (2015), 147–180.

Digital Library

[53]

Ying Ma, Yichang Li, Junwen Lu, Peng Sun, Yu Sun, and Xiatian Zhu. 2018. Data complexity analysis for software defect detection. International Journal of Performability Engineering 14, 8 (2018), 1695.

[54]

Witold Malina. 2001. Two-parameter fisher criterion. IEEE Trans. Syst. Man Cybern., Part B (Cybernetics) 31, 4 (2001), 629–636.

Digital Library

[55]

Tim Menzies, Jeremy Greenwald, and Art Frank. 2006. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering 33, 1 (2006), 2–13.

[56]

Tim Menzies, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, and Ayşe Bener. 2010. Defect prediction from static code features: Current results, limitations, new approaches. Automated Software Engineering 17, 4 (2010), 375–407.

Digital Library

[57]

Christoph Molnar. 2020. Interpretable Machine Learning. Lulu. com.

[58]

Rebecca Moussa and Federica Sarro. 2022. On the use of evaluation measures for defect prediction studies. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 101–113.

Digital Library

[59]

Daniel Müllner. 2011. Modern hierarchical, agglomerative clustering algorithms. arXiv:1109.2378. Retrieved from https://arxiv.org/abs/1109.2378

[60]

Nachiappan Nagappan, Thomas Ball, and Andreas Zeller. 2006. Mining metrics to predict component failures. In Proceedings of the 28th International Conference on Software Engineering. 452–461.

Digital Library

[61]

Jaechang Nam. 2014. Survey on Software Defect Prediction. Technical Report. The Hong Kong University of Science and Technology.

[62]

Jaechang Nam, Wei Fu, Sunghun Kim, Tim Menzies, and Lin Tan. 2017. Heterogeneous defect prediction. IEEE Transactions on Software Engineering 44, 9 (2017), 874–896.

[63]

Jaechang Nam, Sinno Jialin Pan, and Sunghun Kim. 2013. Transfer defect learning. In Proceedings of the 35th International Conference on Software Engineering. 382–391.

[64]

Lucas Chesini Okimoto, Ricardo Manhães Savii, and Ana Carolina Lorena. 2017. Complexity measures effectiveness in feature selection. In Proceedings of the Brazilian Conference on Intelligent Systems. 91–96.

[65]

Albert Orriols-Puig, Núria Macia, and Tin Kam Ho. 2010. Documentation for the data complexity library in c++. Universitat Ramon Llull, La Salle 196, 1-40 (2010), 12.

[66]

Pedro Yuri Arbs Paiva, Camila Castro Moreno, Kate Smith-Miles, Maria Gabriela Valeriano, and Ana Carolina Lorena. 2022. Relating instance hardness to classification performance in a dataset: A visual approach. Machine Learning 111, 8 (2022), 3085–3123.

Digital Library

[67]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12 (2011), 2825–2830.

Digital Library

[68]

Adam H. Peterson and Tony R. Martinez. 2005. Estimating the potential for combining learning models. In Proceedings of the ICML Workshop on Meta-Learning. 68–75.

[69]

Erinija Pranckeviciene, TinKam Ho, and Ray Somorjai. 2006. Class separability in spaces reduced by feature selection. In Proceedings of the 18th International Conference on Pattern Recognition. 254–257.

Digital Library

[70]

Yubin Qu, Xiang Chen, Yingquan Zhao, and Xiaolin Ju. 2018. Impact of hyper parameter optimization for cross-project software defect prediction. International Journal of Performability Engineering 14, 6 (2018), 1291.

[71]

Foyzur Rahman and Premkumar Devanbu. 2013. How, and why, process metrics are better. In Proceedings of the 35th International Conference on Software Engineering. 432–441.

[72]

Miriam Seoane Santos, Pedro Henriques Abreu, Nathalie Japkowicz, Alberto Fernández, Carlos Soares, Szymon Wilk, and Joao Santos. 2022. On the joint-effect of class imbalance and overlap: A critical review. Artificial Intelligence Review 55, 8 (2022), 6207–6275.

Digital Library

[73]

Martin Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. 2013. Data quality: Some comments on the nasa software defect datasets. IEEE Transactions on Software Engineering 39, 9 (2013), 1208–1215.

Digital Library

[74]

Shivkumar Shivaji, E James Whitehead, Ram Akella, and Sunghun Kim. 2012. Reducing features to improve code change-based bug prediction. IEEE Transactions on Software Engineering 39, 4 (2012), 552–569.

Digital Library

[75]

NC Shrikanth, Suvodeep Majumder, and Tim Menzies. 2021. Early life cycle software defect prediction. why? how?. In Proceedings of the 43rd International Conference on Software Engineering. 448–459.

Digital Library

[76]

Janet Siegmund, Norbert Siegmund, and Sven Apel. 2015. Views on internal and external validity in empirical software engineering. In Proceedings of the 37th International Conference on Software Engineering. 9–19.

[77]

Sameer Singh. 2003. Multiresolution estimates of classification complexity. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 12 (2003), 1534–1539.

Digital Library

[78]

Yashpal Singh and Alok Singh Chauhan. 2009. Neural networks in data mining. Journal of Theoretical and Applied Information Technology 5, 6 (2009), 36–42.

[79]

Michael R. Smith. 2009. An empirical Study of Instance Hardness. Master’s thesis. Brigham Young University-Provo.

[80]

Michael R. Smith and Tony Martinez. 2016. A comparative evaluation of curriculum learning with filtering and boosting in supervised classification problems. Computational Intelligence 32, 2 (2016), 167–195.

Digital Library

[81]

Michael R. Smith, Tony Martinez, and Christophe Giraud-Carrier. 2014. An instance level analysis of data complexity. Machine Learning 95, 2 (2014), 225–256.

Digital Library

[82]

Mariana A. Souza, George D. C. Cavalcanti, Rafael M. O. Cruz, and Robert Sabourin. 2019. Online local pool generation for dynamic classifier selection. Pattern Recognition 85 (2019), 132–148.

[83]

Charles Spearman. 1904. The proof and measurement of association between two things. The American Journal of Psychology 15 (1904), 72–101.

[84]

MengXin Sun, KunHong Liu, QingQiang Wu, QingQi Hong, BeiZhan Wang, and Haiying Zhang. 2019. A novel ECOC algorithm for multiclass microarray data classification based on data complexity analysis. Pattern Recognition 90 (2019), 346–362.

Digital Library

[85]

Chakkrit Tantithamthavorn, Ahmed E. Hassan, and Kenichi Matsumoto. 2018. The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Transactions on Software Engineering 46, 11 (2018), 1200–1219.

[86]

Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, and Kenichi Matsumoto. 2016. Automated parameter optimization of classification techniques for defect prediction models. In Proceedings of the 38th International Conference on Software Engineering. 321–332.

Digital Library

[87]

Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, and Kenichi Matsumoto. 2018. The impact of automated parameter optimization on defect prediction models. IEEE Transactions on Software Engineering 45, 7 (2018), 683–711.

[88]

Burak Turhan, Tim Menzies, Ayşe B. Bener, and Justin Di Stefano. 2009. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering 14, 5 (2009), 540–578.

Digital Library

[89]

Felipe N. Walmsley, George D. C. Cavalcanti, Dayvid V. R. Oliveira, Rafael M. O. Cruz, and Robert Sabourin. 2018. An ensemble generation method based on instance hardness. In Proceedings of the International Joint Conference on Neural Networks. 1–8.

[90]

Xiaohui Wan, Zheng Zheng, and Yang Liu. 2022. SPE\(^{2}\): Self-paced ensemble of ensembles for software defect prediction. IEEE Transactions on Reliability 71, 2 (2022), 865–879.

[91]

Song Wang, Taiyue Liu, Jaechang Nam, and Lin Tan. 2018. Deep semantic feature learning for software defect prediction. IEEE Transactions on Software Engineering 46, 12 (2018), 1267–1293.

[92]

Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International Conference on Software Engineering. 297–308.

Digital Library

[93]

Shuo Wang and Xin Yao. 2009. Diversity analysis on imbalanced data sets by using ensemble models. In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining. 324–331.

[94]

Frank Wilcoxon. 1946. Individual comparisons of grouped data by ranking methods. Journal of Economic Entomology 39, 2 (1946), 269–270.

[95]

David H Wolpert. 1996. The lack of a priori distinctions between learning algorithms. Neural Computation 8, 7 (1996), 1341–1390.

Digital Library

[96]

Rongxin Wu, Hongyu Zhang, Sunghun Kim, and Shing-Chi Cheung. 2011. Relink: Recovering links between bugs and changes. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 15–25.

Digital Library

[97]

Zhou Xu, Jin Liu, Zijiang Yang, Gege An, and Xiangyang Jia. 2016. The impact of feature selection on defect prediction performance: An empirical comparison. In Proceedings of the 27th IEEE International Symposium on Software Reliability Engineering. 309–320.

[98]

Meng Yan, Yicheng Fang, David Lo, Xin Xia, and Xiaohong Zhang. 2017. File-level defect prediction: Unsupervised vs. supervised models. In Proceedings of the ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 344–353.

Digital Library

[99]

Jingxiu Yao and Martin Shepperd. 2020. Assessing software defection prediction performance: Why using the matthews correlation coefficient matters. In Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering. 120–129.

Digital Library

[100]

Fang Zhou, Suting Gao, Lyu Ni, Martin Pavlovski, Qiwen Dong, Zoran Obradovic, and Weining Qian. 2022. Dynamic self-paced sampling ensemble for highly imbalanced and class-overlapped data classification. Data Mining and Knowledge Discovery 36, 5 (2022), 1601–1622.

Digital Library

[101]

Tianyi Zhou, Shengjie Wang, and Jeffrey Bilmes. 2020. Curriculum Learning by Dynamic Instance Hardness. In Proceedings of the 34th Conference on Neural Information Processing Systems. 8602–8613.

[102]

Zhi-Hua Zhou. 2012. Ensemble Methods: Foundations and Algorithms. CRC press.

[103]

Thomas Zimmermann and Nachiappan Nagappan. 2008. Predicting defects using network analysis on dependency graphs. In Proceedings of the 30th International Conference on Software Engineering. 531–540.

Digital Library

Cited By

Wan XZheng ZQin FLu XQiu K(2024)Adjusted Trust Score: A Novel Approach for Estimating the Trustworthiness of Software Defect Prediction ModelsIEEE Transactions on Reliability10.1109/TR.2024.339373473:4(1877-1891)Online publication date: Dec-2024
https://doi.org/10.1109/TR.2024.3393734
Eberlein JRodriguez DHarrison R(2024)The effect of data complexity on classifier performanceEmpirical Software Engineering10.1007/s10664-024-10554-530:1Online publication date: 31-Oct-2024
https://doi.org/10.1007/s10664-024-10554-5

Index Terms

Data Complexity: A New Perspective for Analyzing the Difficulty of Defect Prediction Tasks
1. Computing methodologies
  1. Machine learning
2. Software and its engineering
  1. Software creation and management
    1. Software development techniques

Recommendations

Is Bigger Data Better for Defect Prediction: Examining the Impact of Data Size on Supervised and Unsupervised Defect Prediction
Web Information Systems and Applications
Abstract
Defect prediction could help software practitioners to predict the future occurrence of bugs in the software code regions. In order to improve the accuracy of defect prediction, dozens of supervised and unsupervised methods have been put forward ...
Revisiting unsupervised learning for defect prediction
ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering

Collecting quality data from software projects can be time-consuming and expensive. Hence, some researchers explore "unsupervised" approaches to quality prediction that does not require labelled data. An alternate technique is to use "supervised" ...
An investigation on the feasibility of cross-project defect prediction

Software defect prediction helps to optimize testing resources allocation by identifying defect-prone modules prior to testing. Most existing models build their prediction capability based on a set of historical data, presumably from the same or similar ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology

ACM Transactions on Software Engineering and Methodology Volume 33, Issue 6

July 2024

951 pages

EISSN:1557-7392

DOI:10.1145/3613693

Editor:
Mauro Pezzé
USI Universitá della Svizzera italiana and SIT Schaffhausen Institute of Technology, Switzerland

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2024

Online AM: 26 February 2024

Accepted: 15 February 2024

Revised: 01 January 2024

Received: 05 May 2023

Published in TOSEM Volume 33, Issue 6

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
400
Total Downloads

Downloads (Last 12 months)400
Downloads (Last 6 weeks)49

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wan XZheng ZQin FLu XQiu K(2024)Adjusted Trust Score: A Novel Approach for Estimating the Trustworthiness of Software Defect Prediction ModelsIEEE Transactions on Reliability10.1109/TR.2024.339373473:4(1877-1891)Online publication date: Dec-2024
https://doi.org/10.1109/TR.2024.3393734
Eberlein JRodriguez DHarrison R(2024)The effect of data complexity on classifier performanceEmpirical Software Engineering10.1007/s10664-024-10554-530:1Online publication date: 31-Oct-2024
https://doi.org/10.1007/s10664-024-10554-5

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents