[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3196398.3196446acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Automatic classification of software artifacts in open-source applications

Published: 28 May 2018 Publication History

Abstract

With the increasing popularity of open-source software development, there is a tremendous growth of software artifacts that provide insight into how people build software. Researchers are always looking for large-scale and representative software artifacts to produce systematic and unbiased validation of novel and existing techniques. For example, in the domain of software requirements traceability, researchers often use software applications with multiple types of artifacts, such as requirements, system elements, verifications, or tasks to develop and evaluate their traceability analysis techniques. However, the manual identification of rich software artifacts is very labor-intensive. In this work, we first conduct a large-scale study to identify which types of software artifacts are produced by a wide variety of open-source projects at different levels of granularity. Then we propose an automated approach based on Machine Learning techniques to identify various types of software artifacts. Through a set of experiments, we report and compare the performance of these algorithms when applied to software artifacts.

References

[1]
Saeed Abu-Nimeh, Dario Nappa, Xinlei Wang, and Suku Nair. 2007. A comparison of machine learning techniques for phishing detection. In Proceedings of the Antiphishing Working Groups eCrime Researchers Summit. 60--69.
[2]
Pooyan Behnamghader, Duc Minh Le, Joshua Garcia, Daniel Link, Arman Shahbazian, and Nenad Medvidovic. 2017. A large-scale study of architectural evolution in open-source software systems. Journal of Empirical Software Engineering (EMSE) 22, 3 (2017), 1146--1193.
[3]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.
[4]
Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5--32.
[5]
Matthieu Caneill, Daniel M. Germán, and Stefano Zacchiroli. 2017. The Deb-sources Dataset: two decades of free and open source software. Journal of Empirical Software Engineering 22, 3 (2017), 1405--1437.
[6]
Jane Cleland-Huang, Adam Czauderna, Marek Gibiec, and John Emenecker. 2010. A Machine Learning Approach for Tracing Regulatory Codes to Product Specific Requirements. In Proceedings of the International Conference on Software Engineering (ICSE). 155--164.
[7]
Jane Cleland-Huang, Raffaella Settimi, Oussama BenKhadra, Eugenia Berezhanskaya, and Selvia Christina. 2005. Goal-centric Traceability for Managing Nonfunctional Requirements. In Proceedings of the International Conference on Software Engineering (ICSE). 362--371.
[8]
Jacob Cohen. 1988. Statistical power analysis for the behavioral sciences.
[9]
Al Danial. 2018. cloc. (2018). Retrieved March 19, 2018 from https://github.com/AlDanial/cloc
[10]
DataTypes.net. 2018. The most recent filename extension database. (2018). Retrieved March 19, 2018 from https://datatypes.net/
[11]
Andrea De Lucia, Fausto Fasano, Rocco Oliveto, and Genoveffa Tortora. 2006. Can Information Retrieval Techniques Effectively Support Traceability Link Recovery?. In Proceedings of the International Conference on Program Comprehension (ICPC). 307--316.
[12]
Diana Diaz, Gabriele Bavota, Andrian Marcus, Rocco Oliveto, Silvia Takahashi, and Andrea De Lucia. 2013. Using code ownership to improve IR-based Traceability Link Recovery. In Proceedings of the International Conference on Program Comprehension (ICPC). 123--132.
[13]
Hyunsook Do, Sebastian Elbaum, and Gregg Rothermel. 2005. Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Journal of Empirical Software Engineering (EMSE) 10, 4 (2005), 405--435.
[14]
Susan T Dumais. 2004. Latent semantic analysis. Annual review of information science and technology 38, 1 (2004), 188--230.
[15]
Robert Dyer, Hridesh Rajan, Hoan Anh Nguyen, and Tien N. Nguyen. 2014. Mining Billions of AST Nodes to Study Actual and Potential Usage of Java Language Features. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). 779--790.
[16]
Joshua Garcia, Ivo Krka, Chris Mattmann, and Nenad Medvidovic. 2013. Obtaining Ground-Truth Software Architectures. Proceedings of the International Conference on Software Engineering (ICSE) (2013).
[17]
GHTorrent. 2018. Downloads 2014-01-02. (2018). Retrieved March 19, 2018 from http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2014-01-02.tar.gz
[18]
Michael W. Godfrey and Qiang Tu. 2000. Evolution in open source software: A case study. In Proceedings of the International Conference on Software Maintenance (ICSM). 131--142.
[19]
Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proceedings of the International Conference on Mining Software Repositories (MSR). Piscataway, NJ, USA, 233--236.
[20]
Georgios Gousios and Andy Zaidman. 2014. A Dataset for Pull-based Development Research. In Proceedings of the Conference on Mining Software Repositories (MSR) (MSR 2014). 368--371.
[21]
Jin Guo, Natawut Monaikul, Cody Plepel, and Jane Cleland-Huang. 2014. Towards an Intelligent Domain-specific Traceability Solution. In Proceedings of the International Conference on Automated Software Engineering (ASE). 755--766.
[22]
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA Data Mining Software: An Update. SIGKDD Explorations Newsletter 11, 1 (2009), 10--18.
[23]
Kevin A Hallgren. 2012. Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in quantitative methods for psychology 8, 1 (2012), 23.
[24]
GitHub Inc. 2018. GitHub. (2018). Retrieved March 19, 2018 from https://github.com/
[25]
Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European conference on machine learning. 137--142.
[26]
Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, and Daniela Damian. 2016. An in-depth study of the promises and perils of mining GitHub. Journal of Empirical Software Engineering 21, 5 (2016), 2035--2071.
[27]
Richard J Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. Biometrics (1977), 159--174.
[28]
Gernot A. Liebchen and Martin Shepperd. 2008. Data Sets and Data Quality in Software Engineering. In Proceedings of the International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE). 39--44.
[29]
Mario Linares-Vásquez, Collin McMillan, Denys Poshyvanyk, and Mark Grechanik. 2014. On using machine learning to automatically classify software applications into domain categories. Journal of Empirical Software Engineering (EMSE) 19, 3 (2014), 582--618.
[30]
Sugandha Lohar, Sorawit Amornborvornwong, Andrea Zisman, and Jane Cleland-Huang. 2013. Improving Trace Accuracy Through Data-driven Configuration and Composition of Tracing Features. In Proceedings of the Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 378--388.
[31]
Yuzhan Ma. 2018. Online Replication Package. (March 2018). https://github.com/MaggieMa21/SoftwareArtifactsClassification_MSR2018
[32]
Larry M Manevitz and Malik Yousef. 2001. One-class SVMs for document classification. Journal of machine Learning research 2, Dec (2001), 139--154.
[33]
Brian W Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA 2, 405 (1975), 442--451.
[34]
Andrew McCallum, Kamal Nigam, et al. 1998. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, Vol. 752. 41--48.
[35]
Mehdi Mirakhorli and Jane Cleland-Huang. 2016. Detecting, Tracing, and Monitoring Architectural Tactics in Code. Transactions on Software Engineering (TSE) 42, 3 (2016), 205--220.
[36]
Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan. 2017. Curating GitHub for engineered software projects. Journal of Empirical Software Engineering (EMSE) 22, 6 (2017), 3219--3253.
[37]
Luca Pascarella and Alberto Bacchelli. 2017. Classifying code comments in Java open-source software systems. In Proceedings of the International Conference on Mining Software Repositories (MSR). 227--237.
[38]
Dan Port, Allen Nikora, Jane Huffman Hayes, and LiGuo Huang. 2011. Text Mining Support for Software Requirements: Traceability Assurance. In Proceedings of the Hawaii International Conference on System Sciences (HICSS). 1--11.
[39]
Mona Rahimi, Mehdi Mirakhorli, and Jane Cleland-Huang. 2014. Automated extraction and visualization of quality concerns from requirements specifications. In Proceedings of the International Requirements Engineering Conference (RE). 253--262.
[40]
KR Remya and JS Ramya. 2014. Using weighted majority voting classifier combination for relation classification in biomedical texts. In Proceedings of the International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT). 1205--1209.
[41]
Gregorio Robles, Jesus M Gonzalez-Barahona, and Juan Julian Merelo. 2006. Beyond source code: the importance of other artifacts in software development (a case study). Journal of Systems and Software 79, 9 (2006), 1233--1248.
[42]
Gregorio Robles, Jesus M Gonzalez-Barahona, and Juan Luis Prieto. 2006. Assessing and evaluating documentation in libre software projects. In Workshop on Evaluation Frameworks for Open Source Software (EFOSS).
[43]
Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613--620.
[44]
Martin Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. 2013. Data Quality: Some Comments on the NASA Software Defect Datasets. Transactions on Software Engineering (TSE) 39, 9 (2013), 1208--1215.
[45]
Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation 28, 1 (1972), 11--21.
[46]
Hakim Sultanov and Jane Huffman Hayes. 2013. Application of reinforcement learning to requirements engineering: requirements tracing. In Proceedings of the International Requirements Engineering Conference (RE). 52--61.
[47]
Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD). 847--855.
[48]
Kai Tian, Meghan Revelle, and Denys Poshyvanyk. 2009. Using latent dirichlet allocation for automatic categorization of software. In Proceedings of the International Conference on Mining Software Repositories (MSR). 163--166.
[49]
SL Ting, WH Ip, and Albert HC Tsang. 2011. Is Naive Bayes a good classifier for document classification. International Journal of Software Engineering and Its Applications 5, 3 (2011), 37--46.
[50]
Bülent Üstün, W. J. Melssen, and Lutgarde M C Buydens. 2006. Facilitating the application of Support Vector Regression by using a universal Pearson VII function based kernel. Chemometrics and Intelligent Laboratory Systems 81, 1 (2006), 29--40.
[51]
Christopher Vendome, Mario Linares-Vásquez, Gabriele Bavota, Massimiliano Di Penta, Daniel German, and Denys Poshyvanyk. 2017. Machine learning-based detection of open source license exceptions. In Proceedings of the International Conference on Software Engineering (ICSE). 118--129.
[52]
Elaine J Weyuker, Thomas J Ostrand, and Robert M Bell. 2010. Comparing the effectiveness of several modeling methods for fault prediction. Journal of Empirical Software Engineering (EMSE) 15, 3 (2010), 277--295.
[53]
Suresh Yadla, Huffman Jane Hayes, and Alex Dekhtyar. 2005. Tracing requirements to defect reports: an application of information retrieval techniques. Innovations in Systems and Software Engineering (ISSE) 1, 2 (2005), 116--124.
[54]
Qiang Ye, Ziqiong Zhang, and Rob Law. 2009. Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Expert systems with applications 36, 3 (2009), 6527--6535.
[55]
Jiaxin Zhu, Minghui Zhou, and Audris Mockus. 2014. Patterns of Folder Use and Project Popularity: A Case Study of Github Repositories. In Proceedings of the International Symposium on Empirical Software Engineering and Measurement. Article 30, 4 pages.
[56]
Waleed Zogaan, Ibrahim Mujhid, Joanna C. S. Santos, Danielle Gonzalez, and Mehdi Mirakhorli. 2017. Automated Training-set Creation for Software Architecture Traceability Problem. Journal of Empirical Software Engineering (EMSE) 22, 3 (2017), 1028--1062.
[57]
Waleed Zogaan, Palak Sharma, Mehdi Mirahkorli, and Venera Arnaoudova. 2017. Datasets from Fifteen Years of Automated Requirements Traceability Research: Current State, Characteristics, and Quality. In Proceedings of the International Requirements Engineering Conference (RE). 110--121.

Cited By

View all
  • (2024)Challenges and Solutions of Free and Open Source Software Documentation: A Systematic Mapping StudyAnais do XXXVIII Simpósio Brasileiro de Engenharia de Software (SBES 2024)10.5753/sbes.2024.3307(114-125)Online publication date: 30-Sep-2024
  • (2024)Beyond Manual Modeling: Automating GUI Model Generation Using Design DocumentsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695032(91-103)Online publication date: 27-Oct-2024
  • (2024)Triage Software Update Impact via Release Notes ClassificationProcedia Computer Science10.1016/j.procs.2024.06.069238(618-622)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. Automatic classification of software artifacts in open-source applications

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MSR '18: Proceedings of the 15th International Conference on Mining Software Repositories
      May 2018
      627 pages
      ISBN:9781450357166
      DOI:10.1145/3196398
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 May 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. machine learning
      2. open-source software
      3. software artifacts

      Qualifiers

      • Research-article

      Conference

      ICSE '18
      Sponsor:

      Upcoming Conference

      ICSE 2025

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)43
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 11 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Challenges and Solutions of Free and Open Source Software Documentation: A Systematic Mapping StudyAnais do XXXVIII Simpósio Brasileiro de Engenharia de Software (SBES 2024)10.5753/sbes.2024.3307(114-125)Online publication date: 30-Sep-2024
      • (2024)Beyond Manual Modeling: Automating GUI Model Generation Using Design DocumentsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695032(91-103)Online publication date: 27-Oct-2024
      • (2024)Triage Software Update Impact via Release Notes ClassificationProcedia Computer Science10.1016/j.procs.2024.06.069238(618-622)Online publication date: 2024
      • (2024)A survey on machine learning techniques applied to source codeJournal of Systems and Software10.1016/j.jss.2023.111934209:COnline publication date: 14-Mar-2024
      • (2023)Synergy of Patent and Open-Source-Driven Sustainable Climate Governance under Green AI: A Case Study of TinyMLSustainability10.3390/su15181377915:18(13779)Online publication date: 15-Sep-2023
      • (2023)Do CONTRIBUTING Files Provide Information about OSS Newcomers’ Onboarding Barriers?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616288(16-28)Online publication date: 30-Nov-2023
      • (2023)Integrating human values in software development using a human values dashboardEmpirical Software Engineering10.1007/s10664-023-10305-y28:3Online publication date: 18-Apr-2023
      • (2022)Open Source Software Development ChallengesResearch Anthology on Agile Software, Software Development, and Testing10.4018/978-1-6684-3702-5.ch102(2134-2164)Online publication date: 2022
      • (2022)Towards a classification of sustainable software development process using manifold machine learning techniquesJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21260042:6(6183-6194)Online publication date: 1-Jan-2022
      • (2022)Unified abstract syntax tree representation learning for cross-language program classificationProceedings of the 30th IEEE/ACM International Conference on Program Comprehension10.1145/3524610.3527915(390-400)Online publication date: 16-May-2022
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media