[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1868328.1868348acmotherconferencesArticle/Chapter ViewAbstractPublication PagespromiseConference Proceedingsconference-collections
research-article

Sensitivity of results to different data quality meta-data criteria in the sample selection of projects from the ISBSG dataset

Published: 12 September 2010 Publication History

Abstract

Background: Most prediction models, e.g. effort estimation, require preprocessing of data. Some datasets, such as ISBSG, contain data quality meta-data which can be used to filter out low quality cases from the analysis. However, an agreement has not been reached yet between researchers about these data quality selection criteria.
Aims: This paper aims to analyze the influence of data quality meta-data criteria in the number of selected projects, which can have influence in the models obtained. For this, a case study has been selected to gain a more complete understanding of what might be important to focus in future research.
Method: Data quality meta-data selection criteria of some works based on ISBSG dataset which propose prediction models were reviewed first. Considerable attention has been paid to two data quality meta-data variables in ISBSG dataset Release 11 which are Data Quality Rating and Unadjusted Function Point Rating. Secondly, this paper considers data from 830 projects which have been collected from the ISBSG dataset after a preliminary screening. This first screening leads mainly to a subset of projects with comparable definitions in size and effort. Then data quality meta-data criteria are applied in order to infer their influence.
Results: Overall, it seems that data selection criteria, regardless data quality meta-data concerns, involve an important reduction in sample size. From 5052 projects, only 830 are really considered. Then 262 projects remain for analysis if the maximum quality rate is applied for both data quality meta-data variables. But, since the initial data preparation focuses the problem of missingness for a certain purpose, data quality criteria seem not to be the clue for the analysis results. However, some variability has been observed.
Conclusions: Whilst this analysis is supported by a case study, it is hoped that it contributes to a better understanding of the subject. In fact, results found suggest that in those studies where the selection criteria of projects are not very strictly applied, these data quality criteria must be carefully taken into account.

References

[1]
}}Ahmed, F., Bouktif, S., Serhani, A., and Khalil, I. Integrating Function Point Project Information for Improving the Accuracy of Effort Estimation. 2008 The Second International Conference on Advanced Engineering Computing and Applications in Sciences, (2008), 193--198.
[2]
}}Berlin, S., Raz, T., Glezer, C., and Zviran, M. Comparison of estimation methods of cost and duration in IT projects. Information and Software Technology 51, 4 (2009), 738--748.
[3]
}}Bibi, S., Tsoumakas, G., Stamelos, I., and Vlahavas, I. Regression via Classification applied on software defect estimation. Expert Systems with Applications 34, 3 (2008), 2091--2101.
[4]
}}Boehm, B. Value-Based Software Engineering: Overview and Agenda. In Biffl, S., Aurum, A., Boehm, B., Erdogmus, H., and Grünbacher, P. Value-Based Software Engineering. Springer, 2006, 3--14.
[5]
}}Bourque, P., Oligny, S., Abran, A., and Fournier, B. Developing Project Duration Models in Software Engineering. Journal of Computer Science and Technology 22, 3 (2007), 348--357.
[6]
}}Bundschuh, M. and Dekkers, C. The IT Measurement Compendium. Springer, 2008.
[7]
}}Deng, K. and MacDonell, S. G. Maximising data retention from the ISBSG repository. 12th International Conference on Evaluation and Assessment in Software Engineering, (2008).
[8]
}}Déry, D. and Abran, A. Investigation of the Effort Data Consistency in the ISBSG Repository. 15th International Workshop on Software Measurement, (2005), 123--136.
[9]
}}Fernández-Diego, M., Maciel, J., Elmouaden, S., and Torralba-Martínez, J. Physical productivity evolution of software projects in the ISBSG dataset. XIV International Congress on Project Engineering, (2010).
[10]
}}Fernández-Diego, M., Maciel, J., Marcelo-Llácer, D., and Torralba-Martínez, J. Software projects size and economies of scale in the ISBSG dataset. XIV International Congress on Project Engineering, (2010).
[11]
}}Garre, M., Cuadrado, J., Sicilia, M., Charro, M., and Rodríguez, D. Segmented parametric software estimation models: Using the EM algorithm with the ISBSG 8 database. 27th International Conference on Information Technology Interfaces, 2005, 181--187.
[12]
}}Gencel, C. and Demirors, O. Functional size measurement revisited. ACM Trans. Softw. Eng. Methodol. 17, 3 (2008), 1--36.
[13]
}}Gencel, C., Heldal, R., and Lind, K. On the Relationship between Different Size Measures in the Software Life Cycle. 2009 16th Asia-Pacific Software Engineering Conference, (2009), 19--26.
[14]
}}Haaland, K., Stamelos, I., Ghosh, R., and Glott, R. On the Approximation of the Substitution Costs for Free/Libre Open Source Software. 2009 Fourth Balkan Conference in Informatics, (2009), 223--227.
[15]
}}Harman, M., Krinke, J., Ren, J., and Yoo, S. Search based data sensitivity analysis applied to requirement engineering. Proceedings of the 11th Annual conference on Genetic and evolutionary computation, ACM (2009), 1681--1688.
[16]
}}Hericko, M., Zivkovic, A., and Rozman, I. An approach to optimizing software development team size. Information Processing Letters 108, 3 (2008), 101--106.
[17]
}}Huang, S., Chiu, N., and Liu, Y. A comparative evaluation on the accuracies of software effort estimates from clustered data. Information and Software Technology 50, 9--10 (2008), 879--888.
[18]
}}ISBSG. ISBSG dataset Release 11. International Software Benchmarking Standards Group, 2009. http://www.isbsg.org/.
[19]
}}Jeffery, R., Ruhe, M., and Wieczorek, I. A comparative study of two software development cost modeling techniques using multi-organizational and company-specific data. Information and Software Technology 42, 14 (2000), 1009--1016.
[20]
}}Jeffery, R., Ruhe, M., and Wieczorek, I. Using Public Domain Metrics To Estimate Software Development Effort. Proceedings of the 7th International Symposium on Software Metrics, IEEE Computer Society (2001), 16.
[21]
}}Jiang, Z., Naudé, P., and Comstock, C. An investigation on the variation of software development productivity. International Journal of Computer and Information Science and Engineering 1, 2 (2007), 72--81.
[22]
}}Keung, J. and Kitchenham, B. Experiments with Analogy-X for Software Cost Estimation. Proceedings of the 19th Australian Conference on Software Engineering, IEEE Computer Society (2008), 229--238.
[23]
}}Kitchenham, B. A Procedure for Analyzing Unbalanced Datasets. IEEE Trans. Softw. Eng. 24, 4 (1998), 278--301.
[24]
}}Kitchenham, B. and Mendes, E. Why comparative effort prediction studies may be invalid. Proceedings of the 5th International Conference on Predictor Models in Software Engineering (PROMISE), ACM (2009), 1--5.
[25]
}}Koh, T., Selamat, M., and Ghani, A. Exponential Effort Estimation Model Using Unadjusted Function Points. Information Technology Journal 7, 6 (2008), 830--839.
[26]
}}Liebchen, G., Twala, B., Shepperd, M., and Cartwright, M. Assessing the Quality and Cleaning of a Software Project Dataset: An Experience Report. 10th International Conference on Evaluation and Assessment in Software Engineering, (2006).
[27]
}}Liebchen, G. A. and Shepperd, M. Data sets and data quality in software engineering. Proceedings of the 4th international workshop on Predictor models in software engineering (PROMISE), ACM (2008), 39--44.
[28]
}}Lokan, C. and Mendes, E. Investigating the use of chronological split for software effort estimation. IET Software 3, 5 (2009), 422--434.
[29]
}}Lokan, C. and Mendes, E. Cross-company and single-company effort models using the ISBSG database: a further replicated study. Proceedings of the 2006 ACM/IEEE International Symposium on Empirical Software Engineering, ACM (2006), 75--84.
[30]
}}Lokan, C. and Mendes, E. Applying moving windows to software effort estimation. Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement, IEEE Computer Society (2009), 111--122.
[31]
}}Maciel, J., Fernández-Diego, M., Sanz-Berzosa, M., and Torralba-Martínez, J. The recent evolution of the ISBSG (International Software Benchmarking Standards Group) software projects dataset. XIV International Congress on Project Engineering, (2010).
[32]
}}Mendes, E., Lokan, C., Harrison, R., and Triggs, C. A replicated comparison of cross-company and within-company effort estimation models using the ISBSG database. Software Metrics, 2005. 11th IEEE International Symposium, (2005), 10 pp.--36.
[33]
}}Mittas, N. and Angelis, L. Visual comparison of software cost estimation models by regression error characteristic analysis. Journal of Systems and Software 83, 4 (2010), 621--637.
[34]
}}Moses, J. and Farrow, M. Assessing Variation in Development Effort Consistency Using a Data Source with Missing Data. Software Quality Control 13, 1 (2005), 71--89.
[35]
}}Paré, D. and Abran, A. Obvious Outliers in ISBSG Repository of Software Projects: Exploratory Research. Metrics News 10, 1 (2005), 28--36.
[36]
}}Pendharkar, P. C. and Rodger, J. A. An empirical study of the impact of team size on software development effort. Inf. Technol. and Management 8, 4 (2007), 253--262.
[37]
}}Pendharkar, P. C. and Rodger, J. A. The relationship between software development team size and software development cost. Commun. ACM 52, 1 (2009), 141--144.
[38]
}}Pendharkar, P. C., Rodger, J. A., and Subramanian, G. H. An empirical study of the Cobb-Douglas production function properties of software development effort. Inf. Softw. Technol. 50, 12 (2008), 1181--1188.
[39]
}}Pickard, L., Kitchenham, B., and Linkman, S. An Investigation of Analysis Techniques for Software Datasets. Proceedings of the 6th International Symposium on Software Metrics, IEEE Computer Society (1999), 130.
[40]
}}Premraj, R., Kitchenham, B., Shepperd, M., and Forselius, P. An empirical analysis of software productivity over time. 11th IEEE International Symposium On Software Metrics (METRICS 2005), IEEE Computer Society, (2005), 37.
[41]
}}Santillo, L., Lombardi, S., and Natale, D. Advances in statistical analysis from the ISBSG benchmarking database. Proceedings of 2nd Software Measurement European Forum, (2005).
[42]
}}Sentas, P. and Angelis, L. Categorical missing data imputation for software cost estimation by multinomial logistic regression. Journal of Systems and Software 79, 3 (2006), 404--414.
[43]
}}Sentas, P., Angelis, L., Stamelos, I., and Bleris, G. Software productivity and effort prediction with ordinal regression. Information and Software Technology 47, 1 (2005), 17--29.
[44]
}}Seo, Y., Yoon, K., and Bae, D. An empirical analysis of software effort estimation with outlier elimination. Proceedings of the 4th international workshop on Predictor models in software engineering (PROMISE), ACM (2008), 25--32.
[45]
}}Seo, Y., Yoon, K., and Bae, D. Improving the Accuracy of Software Effort Estimation Based on Multiple Least Square Regression Models by Estimation Error-Based Data Partitioning. 2009 16th Asia-Pacific Software Engineering Conference, (2009), 3--10.
[46]
}}Shepperd, M. Software project economics: a roadmap. 2007 Future of Software Engineering, IEEE Computer Society (2007), 304--315.
[47]
}}Twala, B., Cartwright, M., and Shepperd, M. Ensemble of missing data techniques to improve software prediction accuracy. Proceedings of the 28th international conference on Software engineering, ACM (2006), 909--912.
[48]
}}Xia, W., Capretz, L., Ho, D., and Ahmed, F. A new calibration for Function Point complexity weights. Information and Software Technology 50, 7--8 (2008), 670--683.

Cited By

View all
  • (2023)The Impact of Data Quality on Software Testing Effort PredictionElectronics10.3390/electronics1207165612:7(1656)Online publication date: 31-Mar-2023
  • (2017)Exploration of development projects of renewable energy applications in the ISBSG dataset: Empirical study2017 2nd International Conference on the Applications of Information Technology in Developing Renewable Energy Processes & Systems (IT-DREPS)10.1109/IT-DREPS.2017.8277808(1-6)Online publication date: Dec-2017
  • (2017)Investigating the use of moving windows to improve software effort predictionEmpirical Software Engineering10.1007/s10664-016-9446-422:2(716-767)Online publication date: 1-Apr-2017
  • Show More Cited By

Index Terms

  1. Sensitivity of results to different data quality meta-data criteria in the sample selection of projects from the ISBSG dataset

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    PROMISE '10: Proceedings of the 6th International Conference on Predictive Models in Software Engineering
    September 2010
    195 pages
    ISBN:9781450304047
    DOI:10.1145/1868328
    • General Chair:
    • Tim Menzies,
    • Program Chair:
    • Gunes Koru
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 September 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data quality meta-data
    2. datasets
    3. effort
    4. empirical research
    5. functional size
    6. prediction models
    7. software projects

    Qualifiers

    • Research-article

    Conference

    Promise '10

    Acceptance Rates

    PROMISE '10 Paper Acceptance Rate 19 of 53 submissions, 36%;
    Overall Acceptance Rate 98 of 213 submissions, 46%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 21 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)The Impact of Data Quality on Software Testing Effort PredictionElectronics10.3390/electronics1207165612:7(1656)Online publication date: 31-Mar-2023
    • (2017)Exploration of development projects of renewable energy applications in the ISBSG dataset: Empirical study2017 2nd International Conference on the Applications of Information Technology in Developing Renewable Energy Processes & Systems (IT-DREPS)10.1109/IT-DREPS.2017.8277808(1-6)Online publication date: Dec-2017
    • (2017)Investigating the use of moving windows to improve software effort predictionEmpirical Software Engineering10.1007/s10664-016-9446-422:2(716-767)Online publication date: 1-Apr-2017
    • (2016)The usage of ISBSG data fields in software effort estimationJournal of Systems and Software10.1016/j.jss.2015.11.040113:C(188-215)Online publication date: 1-Mar-2016
    • (2015)Integrating non-parametric models with linear components for producing software cost estimationsJournal of Systems and Software10.1016/j.jss.2014.09.02599:C(120-134)Online publication date: 1-Jan-2015
    • (2013)A Taxonomy of Data Quality Challenges in Empirical Software EngineeringProceedings of the 2013 22nd Australian Conference on Software Engineering10.1109/ASWEC.2013.21(97-106)Online publication date: 4-Jun-2013
    • (2012)Discretization methods for NBC in effort estimationProceedings of the ACM-IEEE international symposium on Empirical software engineering and measurement10.1145/2372251.2372268(103-106)Online publication date: 19-Sep-2012
    • (2012)Software Effort Estimation Using NBC and SWRProceedings of the 2012 Joint Conference of the 22nd International Workshop on Software Measurement and the 2012 Seventh International Conference on Software Process and Product Measurement10.1109/IWSM-MENSURA.2012.28(132-136)Online publication date: 17-Oct-2012

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media