Abstract
Although very important in software engineering, establishing traceability links between software artifacts is extremely tedious, error-prone, and it requires significant effort. Even when approaches for automated traceability recovery exist, these provide the requirements analyst with a, usually very long, ranked list of candidate links that needs to be manually inspected. In this paper we introduce an approach called Estimation of the Number of Remaining Links (ENRL) which aims at estimating, via Machine Learning (ML) classifiers, the number of remaining positive links in a ranked list of candidate traceability links produced by a Natural Language Processing techniques-based recovery approach. We have evaluated the accuracy of the ENRL approach by considering several ML classifiers and NLP techniques on three datasets from industry and academia, and concerning traceability links among different kinds of software artifacts including requirements, use cases, design documents, source code, and test cases. Results from our study indicate that: (i) specific estimation models are able to provide accurate estimates of the number of remaining positive links; (ii) the estimation accuracy depends on the choice of the NLP technique, and (iii) univariate estimation models outperform multivariate ones.
Similar content being viewed by others
Notes
For LSA the integer after the technique, e.g., LSA 100, indicates the number of LSA concepts.
This is a common problem in IR-based traceability link recovery (De Lucia et al. 2011).
References
Abadi A, Nisenson M, Simionovici Y (2008) A traceability technique for specifications. In: The 16th IEEE international conference on program comprehension, ICPC 2008, Amsterdam, The Netherlands, June 10–13, 2008. IEEE CS, pp 103–112
Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185
Antoniol G, Canfora G, Casazza G, De Lucia A (2000) Identifying the starting impact set of a maintenance request: a case study. In: European conference on software maintenance and reengineering, CSMR, pp 227–230
Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983
Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering - volume 1, ICSE 2010, Cape Town, South Africa, 1–8 May 2010, pp 95–104
Athanasiadis I (2007) The fuzzy lattice reasoning (flr) classifier for mining environmental data. In: Kaburlasos V, Ritter G (eds) Computational intelligence based on lattice theory, studies in computational intelligence, vol 67. Springer, Berlin, Heidelberg, pp 175–193. doi:10.1007/978-3-540-72687-6_9
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley
Bai CG, Cai KY, Hu QP, Ng SH (2008) On the trend of remaining software defect estimation. IEEE Trans Syst Man Cybern Part A Syst Humans 38(5):1129–1142. doi:10.1109/TSMCA.2008.2001071
Baker RD Edgington E (ed) (1995) Modern permutation test software. Marcel Decker
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022. doi:10.1162/jmlr.2003.3.4-5.993
Borg M, Runeson P, Ardö A (2014) Recovering from a decade: a systematic mapping of information retrieval approaches to software traceability. Empir Softw Eng 19(6):1565–1616. doi:10.1007/s10664-013-9255-y
Breiman L, Breiman L (1996) Bagging predictors. In: Machine learning, pp 123–140
Briand LC, Emam KE, Freimut BG, Laitenberger O (2000) A comprehensive evaluation of capture-recapture models for estimating software defect content. IEEE Trans Softw Eng 26(6):518–540
Briand LC, Falessi D, Nejati S, Sabetzadeh M, Yue T (2014) Traceability and SysML design slices to support safety inspections: a controlled experiment. ACM Trans Softw Eng Methodol 23(1):9:1–9:43. doi:10.1145/2559978
Cai K (1998) On estimating the number of defects remaining in software. J Syst Softw 40(2):93–114. doi:10.1016/S0164-1212(97)00003-4
Capobianco G, De Lucia A, Oliveto R, Panichella A, Panichella S (2009) On the role of the nouns in IR-based traceability recovery. In: The 17th IEEE international conference on program comprehension, ICPC 2009, Vancouver, British Columbia, Canada, May 17–19, 2009. IEEE CS, pp 148–157
Chen T, Sahinoglu M, von Mayrhauser A, Hajjar A, Anderson C (1999) How much testing is enough? Applying stopping rules to behavioral model testing. In: 4th IEEE international symposium on high-assurance systems engineering, 1999. Proceedings, pp 249–256. doi:10.1109/HASE.1999.809500
Cleland-Huang J, Settimi R, Duan C, Zou X (2005) Utilizing supporting evidence to improve dynamic requirements traceability. In: 13th IEEE international conference on requirements engineering (RE 2005), 29 August - 2 September 2005, Paris, France. IEEE CS, pp 135–144
Colwell DJ, Gillett JR (1982) 66.49 Spearman versus Kendall. Math Gaz 66 (438):307–309
Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience
Cuddeback D, Dekhtyar A, Huffman Hayes J, Holden J, Kong W-K (2011) Towards overcoming human analyst fallibility in the requirements tracing process. In: Proceedings of the 33rd international conference on software engineering, ICSE 2011, Waikiki, Honolulu, HI, USA, May 21–28, 2011. ACM, pp 860–863
Czauderna A, Cleland-Huang J, Cinar M, Berenbach B (2012) Just-in-time traceability for mechatronics systems. In: IEEE second workshop on requirements engineering for systems, services and systems-of-systems (RES4), 2012, pp 1–9. doi:10.1109/RES4.2012.6347691
Dag JN, Regnell B, Carlshamre P, Andersson M, Karlsson J (2002) A feasibility study of automated natural language requirements analysis in market-driven development. Requir Eng 7(1):20–33
De Lucia A, Oliveto R, Sgueglia P (2006) Incremental approach and user feedbacks: a silver bullet for traceability recovery. In: 22nd IEEE international conference on software maintenance (ICSM 2006), 24–27 September 2006, Philadelphia, Pennsylvania, USA. IEEE Computer Society, pp 299–309
De Lucia A, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4)
De Lucia A, Oliveto R, Tortora G (2009) Assessing IR-based traceability recovery tools through controlled experiments. Empir Softw Eng 14(1):57–92
De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2011) Improving IR-based traceability recovery using smoothing filters. In: The 19th IEEE international conference on program comprehension, ICPC 2011, Kingston, ON, Canada, June 22–24, 2011. IEEE Computer Society, pp 21–30
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Dekhtyar A, Dekhtyar O, Holden J, Hayes JH, Cuddeback D, Kong W-K (2011) On human analyst performance in assisted requirements tracing: statistical analysis. In: RE 2011, 19th IEEE international requirements engineering conference, Trento, Italy, August 29 2011–September 2, 2011. IEEE, pp 111–120
Duan C, Cleland-Huang J (2007) Clustering support for automated tracing. In: 22nd IEEE/ACM international conference on automated software engineering (ASE 2007), November 5–9, 2007, Atlanta, Georgia, USA. ACM, pp 244–253
Falessi D, Reichel A (2015) Towards an open-source tool for measuring and visualizing the interest of technical debt. In: IEEE 7th international workshop on managing technical debt (MTD), 2015, pp 1–8. doi:10.1109/MTD.2015.7332618
Falessi D, Briand LC, Cantone G (2009) The impact of automated support for linking equivalent requirements based on similarity measures, Tech. rep., Simula Research Laboratory Technical Report 2009– 08
Falessi D, Cantone G, Canfora G (2011) Empirical principles and an industrial case study in retrieving equivalent requirements via natural language processing techniques. IEEE Trans Softw Eng 39(1):18– 44
Falessi D, Shaw MA, Mullen K (2014) Achieving and maintaining CMMI maturity level 5 in a small organization. IEEE Softw 31(5):80–86. doi:10.1109/MS.2014.17
Fellbaum C (1998) Wordnet: an electronic lexical database. The MIT Press
Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion MMRE. IEEE Trans Softw Eng 29(11):985–995
Freund Y, Mason L (1999) The alternating decision tree learning algorithm. In: In machine learning: proceedings of the sixteenth international conference. Morgan Kaufmann, pp 124–133
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:1998
Gethers M, Oliveto R, Poshyvanyk D, De Lucia A (2011) On integrating orthogonal information retrieval methods to improve traceability recovery. In: IEEE 27th international conference on software maintenance, ICSM 2011, Williamsburg, VA, USA, September 25–30, 2011. IEEE, pp 133–142
Hayes JH, Dekhtyar A, Sundaram SK (2006) Advancing candidate link generation for requirements tracing: the study of methods. IEEE Trans Softw Eng 32(1):4–19
Huffman Hayes J, Dekhtyar A, Osborne J (2003) Improving requirements tracing via information retrieval. In: 11th IEEE international conference on requirements engineering (RE 2003), 8–12 September 2003, Monterey Bay, CA, USA. IEEE CS, p 138
Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: Proceedings of the 33rd international conference on software engineering, ICSE ’11, ACM, New York, NY, USA, pp 481–490. doi:10.1145/1985793.1985859
Krishnan S, Strasburg C, Lutz RR, Goseva-Popstojanova K, Dorman KS (2013) Predicting failure-proneness in an evolving software product line. Inf Softw Technol 55(8):1479–1495. doi:10.1016/j.infsof.2012.11.008
Lindvall M, Sandahl K (1996) Practical implications of traceability. Softw Pract Exper 26(10):1161–1180. doi:10.1002/(SICI)1097-024X(199610)26:10〈1161::AID-SPE58〉3.3.CO;2-O
Lohar S, Amornborvornwong S, Zisman A, Cleland-Huang J (2013) Improving trace accuracy through data-driven configuration and composition of tracing features. In: Joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE’13, Saint Petersburg, Russian Federation, August 18–26, 2013. ACM, pp 378–388
Lormans M, van Deursen A (2006) Can lsi help reconstructing requirements traceability in design and test?. In: 10th European conference on software maintenance and reengineering (CSMR 2006), 22–24 March 2006, Bari, Italy. IEEE Computer Society, pp 47–56
Lormans M, van Deursen A, Groß H (2008) An industrial case study in reconstructing requirements views. Empir Softw Eng 13(6):727–760. doi:10.1007/s10664-008-9078-4
Malaiya YK, Denton J (1998) Estimating the number of residual defects [in software]. In: High-assurance systems engineering symposium, 1998. Proceedings. Third IEEE international, pp 98–105. doi:10.1109/HASE.1998.731600
Marcus A, Maletic JI (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: Proceedings of the 25th international conference on software engineering, May 3-10, 2003, Portland, Oregon, USA. IEEE CS, pp 125–137
Mirakhorli M, Cleland-Huang J (2011) Tracing architectural concerns in high assurance systems (nier track). In: Proceedings of the 33rd international conference on software engineering, ICSE ’11, ACM, New York, NY, USA, pp 908–911. doi:10.1145/1985793.1985942
Mirakhorli M, Shin Y, Cleland-Huang J, Cinar M (2012) A tactic-centric approach for automating traceability of quality concerns. In: 2012 34th international conference on software engineering (ICSE). doi:10.1109/ICSE.2012.6227153, pp 639–649
Myers JL, Well AD (2003) Research design and statistical analysis. Lawrence Erlbaum Associates, New Jersey
Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering, ICSE ’06, ACM, New York, NY, USA, pp 452–461. doi:10.1145/1134285.1134349
Nam J, Kim S (2015) Clami: defect prediction on unlabeled datasets. In: Proceedings of the 30th IEEE/ACM international conference on automated software engineering (ASE 2015)
Okutan A, Yildiz OT (2014) Software defect prediction using bayesian networks. Empir Softw Eng 19(1):154–181. doi:10.1007/s10664-012-9218-8
Otis D, Burnham K, White G, Andersonm D (1978) Statistical inference from capture data on closed animal population. Wildl Monogr 62(135)
Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, Lucia AD (2013) How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms. In: 35th international conference on software engineering, ICSE ’13, San Francisco, CA, USA, May 18–26, 2013. IEEE/ACM, pp 522–531
Petersson H, Thelin T, Runeson P, Wohlin C (2004) Capture-recapture in software inspections after 10 years research–theory, evaluation and application. J Syst Softw 72(2):249–264
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Rahman F, Posnett D, Herraiz I, Devanbu P (2013) Sample size vs. bias in defect prediction. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering, ESEC/FSE 2013, ACM, New York, NY, USA, pp 147–157. doi:10.1145/2491411.2491418
Russell SJ, Norvig P (2003) Artificial intelligence: a modern approach, 2nd edn. Pearson Education
Settimi R, Cleland-Huang J, Khadra OB, Mody J, Lukasik W, DePalma C (2004) Supporting software evolution through dynamically retrieving traces to UML artifacts. In: 7th international workshop on principles of software evolution (IWPSE 2004), 6–7 September 2004, Kyoto, Japan. IEEE Computer Society, pp 49–54
Stone M (1974) Cross-validatory choice and assesment of statistical predictions (with discussion). J R Stat Soc Ser B 36:111–147
Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. Kluwer Academic Publishers, Norwell
Yadla S, Huffman Hayes J, Dekhtyar A (2005) Tracing requirements to defect reports: an application of information retrieval techniques. ISSE 1(2):116–124
Zou X, Settimi R, Cleland-Huang J (2007) Term-based enhancement factors for improving automated requirement trace retrieval. In: ACM international symposium on grand challenges of traceability
Zou X, Settimi R, Cleland-Huang J (2010) Improving automated requirements trace retrieval: a study of term-based enhancement methods. Empir Softw Eng 15(2):119–146
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Patrick Mäder, Rocco Oliveto and Andrian Marcus
Rights and permissions
About this article
Cite this article
Falessi, D., Di Penta, M., Canfora, G. et al. Estimating the number of remaining links in traceability recovery. Empir Software Eng 22, 996–1027 (2017). https://doi.org/10.1007/s10664-016-9460-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-016-9460-6