Abstract
Choosing the appropriate missing data (MD) imputation technique for a given software development effort estimation (SDEE) technique is not a trivial task. In fact, the impact of MD imputation on the estimation output depends on the dataset and the SDEE technique used, and there is no best imputation technique in all contexts. Thus, an attractive solution is to use more than one imputation technique and combine their results to obtain a final imputation outcome. This concept is called ensemble imputation and can significantly improve the effort estimation accuracy. This study proposes and constructs 11 heterogeneous ensemble imputation techniques, whose members are two, three, or four of the following single imputation techniques: K-nearest neighbors, expectation maximization, support vector regression (SVR) and decision trees (DTs). The effects of single/ensemble imputation techniques on SDEE performance were evaluated over six SDEE datasets: COCOMO81, ISBSG, Desharnais, China, Kemerer, and Miyazaki. Five SDEE performance measures were used: standardized accuracy (SA), predictor at 25% (Pred (0.25)), mean balanced relative error (MBRE), mean inverted balanced relative error (MIBRE), and logarithmic standard deviation (LSD). Moreover, we used: (1) the Skott-Knott (SK) statistical test to cluster and compare the results, and (2) the Borda count method to rank the SDEE techniques belonging to the best SK cluster.
The results showed that ensemble imputers significantly improved the performance of SDEE techniques compared to single imputation techniques. We also found that adding one or more imputers to the ensemble imputers generally led to a significant improvement in the SDEE performance. When the performance improvement is not significant, it is better to use the ensemble imputer with the minimum number of members because it is less complex. For ensemble imputers, the results suggest that no particular ensemble imputer gave the best results in all contexts. Overall, SVR imputation was the best imputation technique used to construct ensemble imputers for the SDEE. For the SDEE techniques, the best results were obtained by the DTs and SVR variants using ensemble imputation.
Similar content being viewed by others
References
Abnane I, Hosni M, Idri A, Abran A (2019) Analogy software effort estimation using ensemble KNN imputation. 2019 45th Euromicro Conf Softw Eng Adv Appl 228–235. https://doi.org/10.1109/SEAA.2019.00044
Abnane I, Idri A (2016) Evaluating fuzzy analogy on incomplete software projects data. In: 2016 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, pp 1–8
Abnane I, Idri A (2018) Improved analogy-based effort estimation with incomplete mixed data. In: federated conference on computer science and information systems (FedCSIS). Pp 1015–1024
Abnane I, Idri A (2017b) Evaluating fuzzy analogy on incomplete software projects data. In: 2016 IEEE symposium series on computational intelligence, SSCI 2016
Abnane I, Idri A, Abran A (2020) Fuzzy case-based-reasoning-based imputation for incomplete data in software engineering repositories. J Softw Evol Process. https://doi.org/10.1002/smr.2260
Abnane I, Idri A, Hosni M, Abran A (2021) Heterogeneous ensemble imputation for software development effort estimation. In: PROMISE 2021 - proceedings of the 17th international conference on predictive models and data analytics in software engineering, co-located with ESEC/FSE 2021. Pp 1–10
Albrecht AJ, Gaffney JE (1983) Software function, source lines of code, and development effort prediction: a software science validation. IEEE Trans Softw Eng SE-9:639–648. https://doi.org/10.1109/TSE.1983.235271
Amazal FA, Idri A, Abran A (2014) An analogy-based approach to estimation of software development effort using categorical data. In: Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement. pp. 252–262
Aydilek IB, Arslan A (2013) A hybrid method for imputation of missing values using optimized fuzzy c -means with support vector regression and a genetic algorithm. Inf Sci (Ny) 233:25–35. https://doi.org/10.1016/j.ins.2013.01.021
Azzeh M, Nassif AB, Minku LL (2015) An empirical evaluation of ensemble adjustment methods for analogy-based effort estimation. J Syst Softw 103:36–52. https://doi.org/10.1016/j.jss.2015.01.028
Boehm BW (1984) Software engineering economics. IEEE Trans Softw Eng SE-10. https://doi.org/10.1109/TSE.1984.5010193
Campbell C, Cristianini N (1999) Simple learning algorithms for training support vector machines. Univ Bristol 1–29
Cara FJ, Carpio J, Juan J, Alarcón E (2012) An approach to operational modal analysis using the expectation maximization algorithm. Mech Syst Signal Process 31:109–129. https://doi.org/10.1016/j.ymssp.2012.04.004
Cevallos Valdiviezo H, Van Aelst S (2015) Tree-based prediction on incomplete data using imputation or surrogate decisions. Inf Sci (Ny) 311:163–181. https://doi.org/10.1016/j.ins.2015.03.018
Chandra A, Yao X (2006) Ensemble learning using multi-objective evolutionary algorithms. J Math Model Algo 5:417–445. https://doi.org/10.1007/s10852-005-9020-3
Chlioui I, Idri A, Abnane I, Ezzat M (2021) Ensemble case based reasoning imputation in breast cancer classification. J Inf Sci Eng 37(5):1039–1051
Cortes C, Vapnik V (1995a) Support-vector networks. Mach Learn 20:273–297. https://doi.org/10.1023/A:1022627411411
Cortes C, Vapnik V (1995b) Support-vector networks. Mach Learn 20:273–297. https://doi.org/10.1007/BF00994018
Dempster AP, Rubin D (1983) Overview. Incomplete data in sample surveys, Vol. II: Theory and Annotated Bibliography
Dempster AP, Laird NM, Rubin DB (1977a) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–22
Dempster AP, Laird NM, Rubin DB (1977b) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30. https://doi.org/10.1016/j.jecp.2010.03.005
Dong Y, Peng CYJ (2013) Principled missing data methods for researchers. Springerplus 2:1–17. https://doi.org/10.1186/2193-1801-2-222
Dwyer K, Holte R (2007) Decision tree instability and active learning. In: lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). Pp 128–139
Flake GW, Lawrence S (2002) Efficient SVM regression training with SMO. Mach Learn 46:271–290. https://doi.org/10.1023/A:1012474916001
Folguera L, Zupan J, Cicerone D, Magallanes JF (2015) Self-organizing maps for imputation of missing data in incomplete data matrices. Chemom Intell Lab Syst 143:146–151. https://doi.org/10.1016/j.chemolab.2015.03.002
Foss T, Myrtveit I, Stensrud E (2001) MRE and heteroscedasticity: an empirical validation of the assumption of homoscedasticity of the magnitude of relative error. In: Proc. ESCOM, 12th European software control and metrics conference. The Netherlands, pp 157–164
Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion MMRE. IEEE Trans Softw Eng 29:985–995. https://doi.org/10.1109/TSE.2003.1245300
Gholami R, Fakhari N (2017a) Support vector machine: principles, parameters, and applications. In: Handbook of neural computation. Academic Press, pp 515–535. https://doi.org/10.1016/B978-0-12-811318-9.00027-2
Gholami R, Fakhari N (2017b) Support vector machine: principles, parameters, and applications. Handb Neural Comput:515–535. https://doi.org/10.1016/B978-0-12-811318-9.00027-2
Gudivada VN, Irfan MT, Fathi E, Rao DL (2016) Cognitive analytics: going beyond big data analytics and machine learning. In: Handbook of statistics. Elsevier, vol. 35, pp 169–205. https://doi.org/10.1016/bs.host.2016.07.010
Hall M, Frank E, Holmes G et al (2009) The WEKA data mining software. ACM SIGKDD Explor Newsl 11:10–18. https://doi.org/10.1145/1656274.1656278
Hosni M, Idri A, Abran A, Nassif AB (2017) On the value of parameter tuning in heterogeneous ensembles effort estimation. Soft Comput:1–34
Hosni M, Idri A, Nassif AB, Abran A (2016) Heterogeneous ensembles for software development effort estimation. In: 2016 3rd international conference on soft computing & machine intelligence (ISCMI). IEEE, pp 174–178. https://doi.org/10.1109/ISCMI.2016.15
Idri A, Abnane I (2017) Fuzzy analogy based effort estimation: an empirical comparative study. In: 2017 IEEE International Conference on Computer and Information Technology (CIT). IEEE, pp 114–121. https://doi.org/10.1109/CIT.2017.29
Idri A, Amazal FA (2012a) Software cost estimation by fuzzy analogy for ISBSG repository. In: world scientific proc. series on computer engineering and information science 7; uncertainty modeling in knowledge engineering and decision making - proceedings of the 10th international FLINS Conf. Istanbul, Turkey, pp 863–868
Idri A, Amazal FA (2012b) Software cost estimation by fuzzy analogy for ISBSG repository. In: Uncertainty Modeling in Knowledge Engineering and Decision Making, pp 863–868. https://doi.org/10.1142/9789814417747_0138
Idri A, Zahi A (2013) Software cost estimation by classical and Fuzzy Analogy for Web Hypermedia Applications: A replicated study. In: 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM). IEEE, pp 207–213. https://doi.org/10.1109/CIDM.2013.6597238
Idri A, Abnane I, Abran A (2015) Systematic mapping study of missing values techniques in software engineering data. In: 2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, pp 1–8. https://doi.org/10.1109/SNPD.2015.7176280
Idri A, Abnane I, Abran A (2016a) Missing data techniques in analogy-based software development effort estimation. J Syst Softw 117:595–611. https://doi.org/10.1016/j.jss.2016.04.058
Idri A, Abnane I, Abran A (2017) Evaluating Pred( p) and standardized accuracy criteria in software development effort estimation. J Softw Evol Process https://doi.org/10.1002/smr.1925
Idri A, Abnane I, Abran A (2018a) Support vector regression-based imputation in analogy-based software development effort estimation. J Softw Evol Proc 30:e2114. https://doi.org/10.1002/smr.2114
Idri A, Abnane I, Abran A (2018b) Support vector regression-based imputation in analogy-based software development effort estimation. J Softw Evol Proc 30:e2114. https://doi.org/10.1002/smr.2114
Idri A, Amazal FA, Abran A (2014) Analogy-based software development effort estimation: a systematic mapping and review. Inf Softw Technol 58:206–230. https://doi.org/10.1016/j.infsof.2014.07.013
Idri A, Amazal FA, Abran A (2016b) Accuracy comparison of analogy-based software development effort estimation techniques. Int J Intell Syst 0:1–25. https://doi.org/10.1142/S1469026814500138
Idri A, Hosni M, Abran A (2016c) Improved estimation of software development effort using classical and fuzzy analogy ensembles. Appl Soft Comput 49:990–1019. https://doi.org/10.1016/j.asoc.2016.08.012
Idri A, Hosni M, Abran A (2016d) Systematic literature review of ensemble effort estimation. J Syst Softw 118:151–175. https://doi.org/10.1016/j.jss.2016.05.016
Jerez JM, Molina I, García-Laencina PJ et al (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50:105–115. https://doi.org/10.1016/j.artmed.2010.05.002
Kemerer CF (1987) An empirical validation of software cost estimation models. Communications of the ACM 30(5):416–429. https://doi.org/10.1145/22899.22906
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of ICNN'95-international conference on neural networks. IEEE, vol. 4, pp 1942–1948. https://doi.org/10.1109/ICNN.1995.488968
Kitchenham BA, SG MD, Pickard L, Shepperd MJ (2001) What accuracy statistics really measure. IEE Proc – Softw Eng 148:81–85. https://doi.org/10.1049/ip-sen:20010506
Kocaguneli E, Menzies T (2013) Software effort models should be assessed via leave-one-out validation. J Syst Softw 86:1879–1890. https://doi.org/10.1016/j.jss.2013.02.053
Kocaguneli E, Menzies T, Keung JW (2012) On the value of ensemble effort estimation. IEEE Trans Softw Eng 38:1403–1416. https://doi.org/10.1109/TSE.2011.111
Korte M, Port D (2008) Confidence in software cost estimation results based on MMRE and PRED. In: Proceedings of the 4th international workshop on Predictor models in software engineering, pp 63–70. https://doi.org/10.1145/1370788.1370804
Li RH, Belford GG (2002) Instability of decision tree classification algorithms. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 570–575
Little RJ, Rubin DB (1989) The analysis of social science data with missing values. Sociol Methods Res 18(2–3):292–326. https://doi.org/10.1177/0049124189018002004
Little RJA, Rubin D (1987) Statistical analysis with missing data. Wiley, New York
Liu Y, Gopalakrishnan V (2017) An overview and evaluation of recent machine learning imputation methods using cardiac imaging data. Data 2(1):8
Lokan C, Wright T, Hill P, Stringer M (2001) Organizational benchmarking using the ISBSG data repository. Software, IEEE 18:26–32. https://doi.org/10.1109/52.951491
Madley-Dowd P, Hughes R, Tilling K, Heron J (2019) The proportion of missing data should not be used to guide decisions on multiple imputation. J Clin Epidemiol 110:63–73. https://doi.org/10.1016/j.jclinepi.2019.02.016
Maimon O, Rokach L (Eds.) (2005) Data mining and knowledge discovery handbook.
Menzies T, Kocaguneli E, Turhan B, Minku L, Peters F (2014) Sharing data and models in software engineering. Morgan Kaufmann
Menzies T, Krishna R, Pryor D (2017) The SEACRAFT repository of empirical software engineering data. https://zenodo.org/communities/seacraft
Menzies T, Krishna R, Pryor D (2015) The PROMISE Repository of Empirical Software Engineering Data. http://openscience.us/repo
Minku LL, Yao X (2013a) Ensembles and locality: insight on improving software effort estimation. Inf Softw Technol 55:1512–1528. https://doi.org/10.1016/j.infsof.2012.09.012
Minku LL, Yao X (2013b) Software effort estimation as a multiobjective learning problem. ACM Transactions on Software Engineering and Methodology (TOSEM) 22(4):1–32
Mittas N, Angelis L (2012) Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE Trans Softw Eng 39(4):537–551. https://doi.org/10.1109/TSE.2012.45
Miyazaki Y, Takanou A, Nozaki H et al (1991) Method to estimate parameter values in software prediction models. Inf Softw Technol 33:239–243. https://doi.org/10.1016/0950-5849(91)90139-3
Mockus A (2008) Missing data in software engineering. Guide to Advanced Empirical Software Engineering, pp 185–200. https://doi.org/10.1007/978-1-84800-044-5_7
Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G (Eds.) (2014) Handbook of missing data methodology. CRC Press
Monte-Serrat DM, Cattani C (2021) Interpretability in neural networks towards universal consistency. Int J Cogn Comput Eng 2:30–39. https://doi.org/10.1016/J.IJCCE.2021.01.002
Müller KR, Mika S, Rätsch G et al (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw 12:181–201. https://doi.org/10.1109/72.914517
Myrtveit I, Stensrud E, Shepperd M (2005) Reliability and validity in comparative studies of software prediction models. IEEE Trans Softw Eng 31:380–391. https://doi.org/10.1109/TSE.2005.58
Polikar R (2012) Ensemble learning. In: Ensemble machine learning. Springer, Boston, pp 1–34
Qi F, Jing XY, Zhu X et al (2017) Software effort estimation based on open source projects: case study of Github. Inf Softw Technol 92:145–157. https://doi.org/10.1016/j.infsof.2017.07.015
Quinlan JR (1996) Learning decision tree classifiers. ACM Comput Surv 28:71–72. https://doi.org/10.1145/234313.234346
Rahman MG, Islam MZ (2010) A decision tree-based missing value imputation technique for data pre-processing. Conf Res Pract Inf Technol Ser 121:41–50
Rokach L (2019) Ensemble learning: pattern classification using ensemble methods.
Rubin DB (1987) Multiple imputation for nonresponse in surveys. John Wiley & Sons, New York
Sagi O, Rokach L (2018) Ensemble learning: a survey. WIREs Data Mining and Knowledge Discovery 8(4). https://doi.org/10.1002/widm.1249
Sammaknejad N, Zhao Y, Huang B (2019) A review of the expectation maximization algorithm in data-driven process identification. J Process Control 73:123–136. https://doi.org/10.1016/j.jprocont.2018.12.010
Schapire RE (2003) Measures of diversity in classifier ensembles. Mach Learn 51:181–207. https://doi.org/10.1049/ic:20010105
Schneider P, Xhafa F (2022) Machine learning: ML for eHealth systems. Anom Detect Complex Event Process over IoT Data Streams:149–191. https://doi.org/10.1016/B978-0-12-823818-9.00019-5
Scott AJ, Knott M (1974) A cluster analysis method for grouping means in the analysis of variance. Biometrics 30:507–512
Sehra SK, Brar YS, Kaur N, Sehra SS (2017) Research patterns and trends in software effort estimation. Inf Softw Technol 91. https://doi.org/10.1016/j.infsof.2017.06.002
Shepperd M (2007) Software project economics: a roadmap. In: Future of Software Engineering (FOSE'07). IEEE, pp 304–315
Shepperd M, MacDonell S (2012) Evaluating prediction systems in software project estimation. Inf Softw Technol 54:820–827. https://doi.org/10.1016/j.infsof.2011.12.008
Shi Y, Eberhart R (1998) A modified particle swarm optimizer. In: 1998 IEEE international conference on evolutionary computation proceedings. IEEE world congress on computational intelligence (Cat. No. 98TH8360). IEEE, pp 69–73
Stensrud E, Foss T, Kitchenham B, Myrtveit I (2003) A further empirical investigation of the relationship between MRE and project size. Empir Softw Eng 8:139–161. https://doi.org/10.1023/A:1023010612345
Trendowicz A, Jeffery R (2014) Software project effort estimation: foundations and best practice guidelines for success. Springer
Twala B, Cartwright M (2005) Ensemble imputation methods for missing software engineering data. Proc - Int Softw Metrics Symp 2005:271–280. https://doi.org/10.1109/METRICS.2005.21
Twala B, Cartwright M (2010) Ensemble missing data techniques for software effort prediction. Intell Data Anal 14:299–331. https://doi.org/10.3233/IDA-2010-0423
Twala B, Cartwright M, Shepperd M (2006) Ensemble of missing data techniques to improve software prediction accuracy. In: Proceedings of the 28th international conference on Software engineering, pp 909–912
Van Hulse J, Khoshgoftaar TM (2014) Incomplete-case nearest neighbor imputation in software measurement data. Inf Sci (Ny) 259:596–610. https://doi.org/10.1016/j.ins.2010.12.017
Van Hulse J, Khoshgoftaar TM, Seiffert C (2006) A comparison of software fault imputation procedures. In: 2006 5th International Conference on Machine Learning and Applications (ICMLA'06). IEEE, pp 135–142. https://doi.org/10.1109/ICMLA.2006.5
Vateekul P, Sarinnapakorn K (2009) Tree-based approach to missing data imputation. In: 2009 IEEE International Conference on Data Mining Workshops. IEEE, pp 70–75
Wen J, Li S, Lin Z et al (2012) Systematic literature review of machine learning based software development effort estimation models. Inf Softw Technol 54:41–59. https://doi.org/10.1016/j.infsof.2011.09.002
Xia Y (2020) Correlation and association analyses in microbiome study integrating multiomics in health and disease. Prog Mol Biol Trans Sci 171:309–491
Zhang W, Yang Y, Wang Q (2015) Using Bayesian regression and EM algorithm with missing handling for software effort prediction. Inf Softw Technol 58:58–70. https://doi.org/10.1016/j.infsof.2014.10.005
Zhang XZX, Guo YGY (2009) Optimization of SVM parameters based on PSO algorithm. 2009 Fifth Int Conf Nat Comput 1:536–539. https://doi.org/10.1109/ICNC.2009.257
Zhao Y, Zhang Y (2008) Comp Decision Tree Meth Finding Active Objects 41:1955–1959. https://doi.org/10.1016/j.asr.2007.07.020
Zhou ZH (2012) Ensemble methods: foundations and algorithms. CRC press
Zhou ZH, Chen ZQ (2002) Hybrid decision tree. Knowledge-Based Syst 15:515–528. https://doi.org/10.1016/S0950-7051(02)00038-2
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Sousuke Amasaki, Xin Xia, Shane McIntosh
Special Issue on Predictive Models and Data Analytics in Software Engineering (PROMISE) 2021.
Appendix A
Appendix A
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Abnane, I., Idri, A., Chlioui, I. et al. Evaluating ensemble imputation in software effort estimation. Empir Software Eng 28, 56 (2023). https://doi.org/10.1007/s10664-022-10260-0
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-022-10260-0