More Web Proxy on the site http://driver.im/

research-article

Automatic classification of software artifacts in open-source applications

Authors:

Sarah Fakhoury,

Michael Christensen,

Venera Arnaoudova,

Mehdi MirakhorliAuthors Info & Claims

MSR '18: Proceedings of the 15th International Conference on Mining Software Repositories

Pages 414 - 425

https://doi.org/10.1145/3196398.3196446

Published: 28 May 2018 Publication History

Abstract

With the increasing popularity of open-source software development, there is a tremendous growth of software artifacts that provide insight into how people build software. Researchers are always looking for large-scale and representative software artifacts to produce systematic and unbiased validation of novel and existing techniques. For example, in the domain of software requirements traceability, researchers often use software applications with multiple types of artifacts, such as requirements, system elements, verifications, or tasks to develop and evaluate their traceability analysis techniques. However, the manual identification of rich software artifacts is very labor-intensive. In this work, we first conduct a large-scale study to identify which types of software artifacts are produced by a wide variety of open-source projects at different levels of granularity. Then we propose an automated approach based on Machine Learning techniques to identify various types of software artifacts. Through a set of experiments, we report and compare the performance of these algorithms when applied to software artifacts.

References

[1]

Saeed Abu-Nimeh, Dario Nappa, Xinlei Wang, and Suku Nair. 2007. A comparison of machine learning techniques for phishing detection. In Proceedings of the Antiphishing Working Groups eCrime Researchers Summit. 60--69.

Digital Library

[2]

Pooyan Behnamghader, Duc Minh Le, Joshua Garcia, Daniel Link, Arman Shahbazian, and Nenad Medvidovic. 2017. A large-scale study of architectural evolution in open-source software systems. Journal of Empirical Software Engineering (EMSE) 22, 3 (2017), 1146--1193.

Digital Library

[3]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.

Digital Library

[4]

Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5--32.

Digital Library

[5]

Matthieu Caneill, Daniel M. Germán, and Stefano Zacchiroli. 2017. The Deb-sources Dataset: two decades of free and open source software. Journal of Empirical Software Engineering 22, 3 (2017), 1405--1437.

Digital Library

[6]

Jane Cleland-Huang, Adam Czauderna, Marek Gibiec, and John Emenecker. 2010. A Machine Learning Approach for Tracing Regulatory Codes to Product Specific Requirements. In Proceedings of the International Conference on Software Engineering (ICSE). 155--164.

Digital Library

[7]

Jane Cleland-Huang, Raffaella Settimi, Oussama BenKhadra, Eugenia Berezhanskaya, and Selvia Christina. 2005. Goal-centric Traceability for Managing Nonfunctional Requirements. In Proceedings of the International Conference on Software Engineering (ICSE). 362--371.

Digital Library

[8]

Jacob Cohen. 1988. Statistical power analysis for the behavioral sciences.

[9]

Al Danial. 2018. cloc. (2018). Retrieved March 19, 2018 from https://github.com/AlDanial/cloc

[10]

DataTypes.net. 2018. The most recent filename extension database. (2018). Retrieved March 19, 2018 from https://datatypes.net/

[11]

Andrea De Lucia, Fausto Fasano, Rocco Oliveto, and Genoveffa Tortora. 2006. Can Information Retrieval Techniques Effectively Support Traceability Link Recovery?. In Proceedings of the International Conference on Program Comprehension (ICPC). 307--316.

Digital Library

[12]

Diana Diaz, Gabriele Bavota, Andrian Marcus, Rocco Oliveto, Silvia Takahashi, and Andrea De Lucia. 2013. Using code ownership to improve IR-based Traceability Link Recovery. In Proceedings of the International Conference on Program Comprehension (ICPC). 123--132.

[13]

Hyunsook Do, Sebastian Elbaum, and Gregg Rothermel. 2005. Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Journal of Empirical Software Engineering (EMSE) 10, 4 (2005), 405--435.

Digital Library

[14]

Susan T Dumais. 2004. Latent semantic analysis. Annual review of information science and technology 38, 1 (2004), 188--230.

[15]

Robert Dyer, Hridesh Rajan, Hoan Anh Nguyen, and Tien N. Nguyen. 2014. Mining Billions of AST Nodes to Study Actual and Potential Usage of Java Language Features. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). 779--790.

Digital Library

[16]

Joshua Garcia, Ivo Krka, Chris Mattmann, and Nenad Medvidovic. 2013. Obtaining Ground-Truth Software Architectures. Proceedings of the International Conference on Software Engineering (ICSE) (2013).

Digital Library

[17]

GHTorrent. 2018. Downloads 2014-01-02. (2018). Retrieved March 19, 2018 from http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2014-01-02.tar.gz

[18]

Michael W. Godfrey and Qiang Tu. 2000. Evolution in open source software: A case study. In Proceedings of the International Conference on Software Maintenance (ICSM). 131--142.

Digital Library

[19]

Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proceedings of the International Conference on Mining Software Repositories (MSR). Piscataway, NJ, USA, 233--236.

Digital Library

[20]

Georgios Gousios and Andy Zaidman. 2014. A Dataset for Pull-based Development Research. In Proceedings of the Conference on Mining Software Repositories (MSR) (MSR 2014). 368--371.

Digital Library

[21]

Jin Guo, Natawut Monaikul, Cody Plepel, and Jane Cleland-Huang. 2014. Towards an Intelligent Domain-specific Traceability Solution. In Proceedings of the International Conference on Automated Software Engineering (ASE). 755--766.

Digital Library

[22]

Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA Data Mining Software: An Update. SIGKDD Explorations Newsletter 11, 1 (2009), 10--18.

Digital Library

[23]

Kevin A Hallgren. 2012. Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in quantitative methods for psychology 8, 1 (2012), 23.

[24]

GitHub Inc. 2018. GitHub. (2018). Retrieved March 19, 2018 from https://github.com/

[25]

Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European conference on machine learning. 137--142.

Digital Library

[26]

Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, and Daniela Damian. 2016. An in-depth study of the promises and perils of mining GitHub. Journal of Empirical Software Engineering 21, 5 (2016), 2035--2071.

Digital Library

[27]

Richard J Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. Biometrics (1977), 159--174.

[28]

Gernot A. Liebchen and Martin Shepperd. 2008. Data Sets and Data Quality in Software Engineering. In Proceedings of the International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE). 39--44.

Digital Library

[29]

Mario Linares-Vásquez, Collin McMillan, Denys Poshyvanyk, and Mark Grechanik. 2014. On using machine learning to automatically classify software applications into domain categories. Journal of Empirical Software Engineering (EMSE) 19, 3 (2014), 582--618.

Digital Library

[30]

Sugandha Lohar, Sorawit Amornborvornwong, Andrea Zisman, and Jane Cleland-Huang. 2013. Improving Trace Accuracy Through Data-driven Configuration and Composition of Tracing Features. In Proceedings of the Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 378--388.

Digital Library

[31]

Yuzhan Ma. 2018. Online Replication Package. (March 2018). https://github.com/MaggieMa21/SoftwareArtifactsClassification_MSR2018

[32]

Larry M Manevitz and Malik Yousef. 2001. One-class SVMs for document classification. Journal of machine Learning research 2, Dec (2001), 139--154.

Digital Library

[33]

Brian W Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA 2, 405 (1975), 442--451.

[34]

Andrew McCallum, Kamal Nigam, et al. 1998. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, Vol. 752. 41--48.

[35]

Mehdi Mirakhorli and Jane Cleland-Huang. 2016. Detecting, Tracing, and Monitoring Architectural Tactics in Code. Transactions on Software Engineering (TSE) 42, 3 (2016), 205--220.

Digital Library

[36]

Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan. 2017. Curating GitHub for engineered software projects. Journal of Empirical Software Engineering (EMSE) 22, 6 (2017), 3219--3253.

Digital Library

[37]

Luca Pascarella and Alberto Bacchelli. 2017. Classifying code comments in Java open-source software systems. In Proceedings of the International Conference on Mining Software Repositories (MSR). 227--237.

Digital Library

[38]

Dan Port, Allen Nikora, Jane Huffman Hayes, and LiGuo Huang. 2011. Text Mining Support for Software Requirements: Traceability Assurance. In Proceedings of the Hawaii International Conference on System Sciences (HICSS). 1--11.

Digital Library

[39]

Mona Rahimi, Mehdi Mirakhorli, and Jane Cleland-Huang. 2014. Automated extraction and visualization of quality concerns from requirements specifications. In Proceedings of the International Requirements Engineering Conference (RE). 253--262.

[40]

KR Remya and JS Ramya. 2014. Using weighted majority voting classifier combination for relation classification in biomedical texts. In Proceedings of the International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT). 1205--1209.

[41]

Gregorio Robles, Jesus M Gonzalez-Barahona, and Juan Julian Merelo. 2006. Beyond source code: the importance of other artifacts in software development (a case study). Journal of Systems and Software 79, 9 (2006), 1233--1248.

Digital Library

[42]

Gregorio Robles, Jesus M Gonzalez-Barahona, and Juan Luis Prieto. 2006. Assessing and evaluating documentation in libre software projects. In Workshop on Evaluation Frameworks for Open Source Software (EFOSS).

[43]

Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613--620.

Digital Library

[44]

Martin Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. 2013. Data Quality: Some Comments on the NASA Software Defect Datasets. Transactions on Software Engineering (TSE) 39, 9 (2013), 1208--1215.

Digital Library

[45]

Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation 28, 1 (1972), 11--21.

[46]

Hakim Sultanov and Jane Huffman Hayes. 2013. Application of reinforcement learning to requirements engineering: requirements tracing. In Proceedings of the International Requirements Engineering Conference (RE). 52--61.

Digital Library

[47]

Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD). 847--855.

Digital Library

[48]

Kai Tian, Meghan Revelle, and Denys Poshyvanyk. 2009. Using latent dirichlet allocation for automatic categorization of software. In Proceedings of the International Conference on Mining Software Repositories (MSR). 163--166.

Digital Library

[49]

SL Ting, WH Ip, and Albert HC Tsang. 2011. Is Naive Bayes a good classifier for document classification. International Journal of Software Engineering and Its Applications 5, 3 (2011), 37--46.

[50]

Bülent Üstün, W. J. Melssen, and Lutgarde M C Buydens. 2006. Facilitating the application of Support Vector Regression by using a universal Pearson VII function based kernel. Chemometrics and Intelligent Laboratory Systems 81, 1 (2006), 29--40.

[51]

Christopher Vendome, Mario Linares-Vásquez, Gabriele Bavota, Massimiliano Di Penta, Daniel German, and Denys Poshyvanyk. 2017. Machine learning-based detection of open source license exceptions. In Proceedings of the International Conference on Software Engineering (ICSE). 118--129.

Digital Library

[52]

Elaine J Weyuker, Thomas J Ostrand, and Robert M Bell. 2010. Comparing the effectiveness of several modeling methods for fault prediction. Journal of Empirical Software Engineering (EMSE) 15, 3 (2010), 277--295.

Digital Library

[53]

Suresh Yadla, Huffman Jane Hayes, and Alex Dekhtyar. 2005. Tracing requirements to defect reports: an application of information retrieval techniques. Innovations in Systems and Software Engineering (ISSE) 1, 2 (2005), 116--124.

[54]

Qiang Ye, Ziqiong Zhang, and Rob Law. 2009. Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Expert systems with applications 36, 3 (2009), 6527--6535.

Digital Library

[55]

Jiaxin Zhu, Minghui Zhou, and Audris Mockus. 2014. Patterns of Folder Use and Project Popularity: A Case Study of Github Repositories. In Proceedings of the International Symposium on Empirical Software Engineering and Measurement. Article 30, 4 pages.

Digital Library

[56]

Waleed Zogaan, Ibrahim Mujhid, Joanna C. S. Santos, Danielle Gonzalez, and Mehdi Mirakhorli. 2017. Automated Training-set Creation for Software Architecture Traceability Problem. Journal of Empirical Software Engineering (EMSE) 22, 3 (2017), 1028--1062.

Digital Library

[57]

Waleed Zogaan, Palak Sharma, Mehdi Mirahkorli, and Venera Arnaoudova. 2017. Datasets from Fifteen Years of Automated Requirements Traceability Research: Current State, Characteristics, and Quality. In Proceedings of the International Requirements Engineering Conference (RE). 110--121.

Cited By

Pinho GCaçula ACosta LWiese IAraújo A(2024)Challenges and Solutions of Free and Open Source Software Documentation: A Systematic Mapping StudyAnais do XXXVIII Simpósio Brasileiro de Engenharia de Software (SBES 2024)10.5753/sbes.2024.3307(114-125)Online publication date: 30-Sep-2024
https://doi.org/10.5753/sbes.2024.3307
Cao SChen RPan MYang WLi XFilkov VRay BZhou M(2024)Beyond Manual Modeling: Automating GUI Model Generation Using Design DocumentsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695032(91-103)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695032
Berhe SKan VKhan OPader NFarooqui AMaynard MKhomh F(2024)Triage Software Update Impact via Release Notes ClassificationProcedia Computer Science10.1016/j.procs.2024.06.069238(618-622)Online publication date: 2024
https://doi.org/10.1016/j.procs.2024.06.069
Show More Cited By

Index Terms

Automatic classification of software artifacts in open-source applications
1. Software and its engineering
  1. Software notations and tools
    1. Software libraries and repositories
  2. Software organization and properties

Recommendations

Supporting the evolution of software knowledge with adaptive software artifacts
OOPSLA '10: Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion

The knowledge of software developers materializes itself as software artifacts, that may be seen at two different levels (information and structure), which are difficult to change independently from each other. This work explores how the expression of ...
Self-organization process in open-source software: An empirical study

Software systems must continually evolve to adapt to new functional requirements or quality requirements to remain competitive in the marketplace. However, different software systems follow different strategies to evolve, affecting both the release plan ...
Software Artifact Mining in Software Engineering Conferences: A Meta-Analysis
ESEM '22: Proceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement

Background: Software development results in the production of various types of artifacts: source code, version control system metadata, bug reports, mailing list conversations, test data, etc. Empirical software engineering (ESE) has thrived mining ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MSR '18: Proceedings of the 15th International Conference on Mining Software Repositories

May 2018

627 pages

ISBN:9781450357166

DOI:10.1145/3196398

General Chair:
Andy Zaidman
Delft University of Technology, Netherlands
,
Program Chairs:
Yasutaka Kamei
Kyushu University, Japan
,
Emily Hill
Drew University

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICSE '18

Sponsor:

SIGSOFT
IEEE-CS

ICSE '18: 40th International Conference on Software Engineering

May 28 - 29, 2018

Gothenburg, Sweden

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
416
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)6

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pinho GCaçula ACosta LWiese IAraújo A(2024)Challenges and Solutions of Free and Open Source Software Documentation: A Systematic Mapping StudyAnais do XXXVIII Simpósio Brasileiro de Engenharia de Software (SBES 2024)10.5753/sbes.2024.3307(114-125)Online publication date: 30-Sep-2024
https://doi.org/10.5753/sbes.2024.3307
Cao SChen RPan MYang WLi XFilkov VRay BZhou M(2024)Beyond Manual Modeling: Automating GUI Model Generation Using Design DocumentsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695032(91-103)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695032
Berhe SKan VKhan OPader NFarooqui AMaynard MKhomh F(2024)Triage Software Update Impact via Release Notes ClassificationProcedia Computer Science10.1016/j.procs.2024.06.069238(618-622)Online publication date: 2024
https://doi.org/10.1016/j.procs.2024.06.069
Sharma TKechagia MGeorgiou STiwari RVats IMoazen HSarro F(2024)A survey on machine learning techniques applied to source codeJournal of Systems and Software10.1016/j.jss.2023.111934209:COnline publication date: 14-Mar-2024
https://dl.acm.org/doi/10.1016/j.jss.2023.111934
Li TLuo JLiang KYi CMa L(2023)Synergy of Patent and Open-Source-Driven Sustainable Climate Governance under Green AI: A Case Study of TinyMLSustainability10.3390/su15181377915:18(13779)Online publication date: 15-Sep-2023
https://doi.org/10.3390/su151813779
Fronchetti FShepherd DWiese ITreude CGerosa MSteinmacher IChandra SBlincoe KTonella P(2023)Do CONTRIBUTING Files Provide Information about OSS Newcomers’ Onboarding Barriers?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616288(16-28)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616288
Nurwidyantoro AShahin MChaudron MHussain WPerera HShams RWhittle J(2023)Integrating human values in software development using a human values dashboardEmpirical Software Engineering10.1007/s10664-023-10305-y28:3Online publication date: 18-Apr-2023
https://dl.acm.org/doi/10.1007/s10664-023-10305-y
Seker ADiri BArslan HAmasyalı M(2022)Open Source Software Development ChallengesResearch Anthology on Agile Software, Software Development, and Testing10.4018/978-1-6684-3702-5.ch102(2134-2164)Online publication date: 2022
https://doi.org/10.4018/978-1-6684-3702-5.ch102
Hamdi M(2022)Towards a classification of sustainable software development process using manifold machine learning techniquesJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21260042:6(6183-6194)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/JIFS-212600
Wang KYan MZhang HHu HRastogi ATufano RBavota GArnaoudova VHaiduc S(2022)Unified abstract syntax tree representation learning for cross-language program classificationProceedings of the 30th IEEE/ACM International Conference on Program Comprehension10.1145/3524610.3527915(390-400)Online publication date: 16-May-2022
https://dl.acm.org/doi/10.1145/3524610.3527915
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten