On using machine learning to automatically classify software applications into domain categories

Mario Linares-Vásquez¹,
Collin McMillan²,
Denys Poshyvanyk¹ &
…
Mark Grechanik³

2156 Accesses
60 Citations
Explore all metrics

Abstract

Software repositories hold applications that are often categorized to improve the effectiveness of various maintenance tasks. Properly categorized applications allow stakeholders to identify requirements related to their applications and predict maintenance problems in software projects. Manual categorization is expensive, tedious, and laborious – this is why automatic categorization approaches are gaining widespread importance. Unfortunately, for different legal and organizational reasons, the applications’ source code is often not available, thus making it difficult to automatically categorize these applications. In this paper, we propose a novel approach in which we use Application Programming Interface (API) calls from third-party libraries for automatic categorization of software applications that use these API calls. Our approach is general since it enables different categorization algorithms to be applied to repositories that contain both source code and bytecode of applications, since API calls can be extracted from both the source code and byte-code. We compare our approach to a state-of-the-art approach that uses machine learning algorithms for software categorization, and conduct experiments on two large Java repositories: an open-source repository containing 3,286 projects and a closed-source repository with 745 applications, where the source code was not available. Our contribution is twofold: we propose a new approach that makes it possible to categorize software projects without any source code using a small number of API calls as attributes, and furthermore we carried out a comprehensive empirical evaluation of automatic categorization approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Combining Clustering and Classification for Software Quality Evaluation

Emerging topics in mining software repositories

Article 28 March 2018

Mining Repository for Module Reuse: A Machine Learning-Based Approach

Notes

Accenture policy 69 states that source code constitutes confidential information because it is information or material, not generally available to the public, that is generated, collected or used by the Company and that relates to its business, research and development activities, clients, or employees.
http://www.cs.wm.edu/semeru/catml_ese/ (verified on 05/07/2012)
http://sharejar.com/ (verified on 05/07/2012)
http://sourceforge.net/ (verified on 05/07/2012)
http://www.ibiblio.org/ (verified on 05/07/2012)
http://www.oracle.com/technetwork/java/javase/downloads/ (verified on 05/07/2012)
http://jclassinfo.sourceforge.net/ (verified on 05/07/2012)
http://pmd.sourceforge.net/ (verified on 05/07/2012)
http://weka.sourceforge.net/ (verified on 05/07/2012)
http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (on 05/07/2012)
This research question was added once we analyzed the results of the study and found that SVM was the top performing ML algorithms across a range of parameters and settings
By “consistency” we mean that all algorithms have the best performance for the same attribute set.
We did not use typical effect size estimators such as Cohen’s d, because we do not assume that the samples are drawn from normal distributions.
For the k parameter in the IBK algorithm we used odd values to avoid tied votes.
In Fig. 2a, b, c the values of SVM above the first quartile are higher than the third quartile of the other algorithms. In Fig. 2.d the values of SVM below the third quartile are lower than the first quartile of the other algorithms
The Q _critical and Q _observed values are provided in our online appendix.
We do not show any comparison of between DT, NB, JR,IP, and IBK because those algorithms performed less well than SVM
http://www.oracle.com/technetwork/java/javame/javamobile/documentation/index.html (verified on 05/07/2012)
0.671 (Average PREC), 0.671 (Average REC), 0.671 (Average F-measure), 0.024 (Average FPR)
http://sourceforge.net/apps/trac/sourceforge/wiki/Software%20Map%20and%20Trove (verified on 05/07/2012)

References

Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6:37–66
Google Scholar
Alpaydin E (2010) Introduction to machine learning, 2nd edn. The MIT Press, Cambridge, Massachusetts
Antoniol G, Ayari K, Di Penta M, Khomh F, Guéhéneuc Y-G (2008) Is it a bug or an enhancement?: a text-based approach to classify change requests. 18th Conference of the Centre for Advanced Studies on Collaborative Research Meeting of Minds (CASCON’08), Ontario, Canada, pp 304–318
Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? 28th International Conference on Software Engineering (ICSE’06), pp 361–370
Anvik J, Murphy GC (2011) Reducing the effort of bug report triage: recommenders for development-oriented decisions. ACM Transactions on Software Engineering and Methods 20(3):10:1–10:35
Google Scholar
Bajracharya S, Ossher J, Lopes CV (2010) Leveraging usage similarity for effective retrieval of examples in code repositories. 18th International Symposium on the Foundations of Software Engineering (FSE’10)
Bruno M, Canfora G, Di Penta M, Scognamiglio R (2005) An approach to support web service classification and annotation. IEEE International Conference on e-Technology, e-Commerce and e-Services (EEE’05), pp 138–143
Bugde S, Nagappan N, Rajamani S, Ramalingam G (2008) Global software servicing: observational experiences at Microsoft. 2008 IEEE International Conference on Global Software Engineering (ICGSE’08), pp 182–191
Cohen WW (1995) Fast effective rule induction. 12th International Conference on Machine Learning, pp 115–123
Crammer K, Singer Y (2003) A family of additive online algorithms for category ranking. J Mach Learn Res 3(6):1025–1058
MATH MathSciNet Google Scholar
de Carvalho ACPLF, Freitas AA (2009) A tutorial on multi-label classification techniques. Foundations of Computational Intelligence. A. Abraham, A.-E. Hassanien and V. Snásel, Springer-Verlag, 5
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MATH MathSciNet Google Scholar
Di Lucca GA, Di Penta M, Gradara S (2002) An approach to classify software maintenance requests. IEEE International Conference on Software Maintenance (ICSM’02), Montréal, Québec, Canada, pp 93–102
Dit B, Guerrouj L, Poshyvanyk D, Antoniol G (2011) Can better identifier splitting techniques help feature location? 19th IEEE International Conference on Program Comprehension (ICPC’11), Kingston, Ontario, Canada, pp 11–20
Dumitru H, Gibiec M, Hariri N, Cleland-Huang J, Mobasher B, Castro-Herrera C, Mirakhorli M (2011) On-demand feature recommendations derived from mining public product descriptions. 33rd IEEE/ACM International Conference on Software Engineering (ICSE’11), Honolulu, Hawaii, USA, pp 181–190
Feng C-XJ, Yu Z-GS, Emanuel JT, Li P-G, Shao X-Y, Wang Z-H (2008) Threefold versus fivefold cross-validation and individual versus average data in predictive regression modelling of machining experimental data. Int J Comput Integrated Manuf 21(6):702–714
Article Google Scholar
Frakes W, Prieto-Diaz R, Fox C (1998) DARE: domain analysis and reuse environment. Ann Software Eng 5:125–141
Article Google Scholar
Grechanik M, Csallner C, Fu C, Xie Q (2010) Is data privacy always good for software testing? 21st IEEE International Symposium on Software Reliability Engineering (ISSRE’10), San Jose, California, USA, pp 368–377
Grechanik M, Fu C, Xie Q, McMillan C, Poshyvanyk D, Cumby C (2010) A search engine for finding highly relevant applications. 32nd ACM/IEEE International Conference on Software Engineering (ICSE’10), Cape Town, South Africa, pp 475–484
Grechanik M, McMillan C, DeFerrari L, Comi M, Crespi S, Poshyvanyk D, Fu C, Xie Q, Ghezzi C (2010) An empirical investigation into a large-scale java open source code repository. 4th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM ’10), Bolzano-Bozen, Italy
Grissom RJ, Kim JJ (2012) Effect sizes for research: univariate and multivariate applications, 2nd edn. Taylor & Francis, New York
Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Hindle A, Germán DM, Godfrey MW, Holt RC (2009) Automatic Classification of Large Changes into Maintenance Categories. 17th IEEE International Conference on Program Comprehension (ICPC’09), Vancouver, Canada, pp 30–39
Hsu C, Lin C (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Network 13(2):415–425
Article Google Scholar
Ji S, Tang L, Yu S, Ye J (2008) Extracting shared subspace for multi-label classification. 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), Las Vegas, Nevada, USA, pp 381–389
Jones C (2010) Software engineering best practices. McGraw-Hill, New York
Google Scholar
Kang KC, Cohen S, Hess J, Novak W, Peterson A (1990) Feature-oriented domain analysis (FODA) feasibility study Pittsburgh, Pennsylvania, USA, Carnegie Mellon University, Software Engineering Institute
Kawaguchi S, Garg PK, Matsushita M, Inoue K (2003) Automatic categorization algorithm for evolvable software archive. 6th International Workshop on Principles of Software Evolution (IWPSE’03), pp 195–200
Kawaguchi S, Garg PK, Matsushita M, Inoue K (2006) MUDABlue: an automatic categorization system for open source repositories. J Syst Software 79(7):939–953
Article Google Scholar
Kelly MB, Alexander JS, Adams B, Hassan AE (2011) Recovering a balanced overview of topics in a software domain. 11th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM’11), Williamsburg, VA, USA, to appear
Leopold E, Kindermann J (2002) Text categorization with support vector machines. How to represent texts in input space? Mach Learn 46(1):423–444
Article MATH Google Scholar
Lorena AC, De Carvalho ACPLF (2004) Comparing techniques for multiclass classification using binary SVM predictors. Third Mexican International Conference on Artificial Intelligence (MICAI’04), Mexico City, Mexico, Springer, pp 272–281
McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding relevant functions and their usages. 33rd IEEE/ACM International Conference on Software Engineering (ICSE’11), Honolulu, Hawaii, USA, pp 111–120
McMillan C, Linares-Vásquez M, Poshyvanyk D, Grechanik M (2011) Categorizing software applications for maintenance. 27th IEEE International Conference on Software Maintenance (ICSM’11), Williamsburg, Virginia, USA, pp 343–352
Menzies T, Marcus A (2008) Automated severity assessment of software defect reports. IEEE International Conference on Software Maintenance (ICSM’08), Beijing, China, pp 346–355
Poshyvanyk D, Grechanik M (2009) Creating and evolving software by searching, selecting and synthesizing relevant source code. 31st IEEE/ACM International Conference on Software Engineering (ICSE’09), Vancouver, British Columbia, Canada, pp 283–286
Prieto-Diaz R (1990) Domain analysis: an introduction. ACM SIGSOFT Software Eng Notes 15(2):47–54
Article Google Scholar
Ratiu D, Deissenboeck F (2006) How programs represent reality (and How They Don’t). 13th Working Conference on Reverse Engineering (WCRE’06), pp 83–92
Ratiu D, Deissenboeck F (2007) From reality to programs and (not quite) back again. 15th IEEE International Conference on Program Comprehension (ICPC’07), Banff, Alberta, Canada, pp 91–102
Sandhu PS, Singh J, Singh H (2007) Approaches for categorization of reusable software components. J Comput Sci 3(5):266–273
Article Google Scholar
Schuler D, Dallmeir V, Lindig C (2007) A dynamic birthmark for java. Twenty-second IEEE/ACM International Conference on Automated software Engineering (ASE 2007), Atlanta, Georgia, USA, pp 274–283
Sebastiani, F. (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Google Scholar
Sim SE, Umarji M, Ratanotayanon S, Lopes CV (2011) How well do search engines support code retrieval on the web? ACM Trans Software Eng Meth (TOSEM) 21(1)
Tian K, Revelle M, Poshyvanyk D (2009) Using latent Dirichlet allocation for automatic categorization of software. 6th IEEE Working Conference on Mining Software Repositories (MSR’09), Vancouver, British Columbia, Canada, pp 163–166
Ugurel S, Krovetz R, Giles CL (2002) What’s the code ? Automatic classification of source code archives. Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmontong, Alberta, Canada, pp 632–638
Újházi B, Ferenc R, Poshyvanyk D, Gyimóthy T (2010) New conceptual coupling and cohesion metrics for object-oriented systems. 10th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM’10), Timişoara, Romania, pp 33–42
Weiss C, Premraj R, Zimmermann T, Zeller A (2007) How long will it take to fix this bug? 4th IEEE International Workshop on Mining Software Repositories (MSR’07), Minneapolis, MN, pp 1–8
Zhang M-L, Zhou Z-H (2005) A k-nearest neighbor based algorithm for multi-label classification. IEEE International Conference on Granular Computing, Beijing, China, pp 718–721
Zhang M-L, Zhou Z-H (2006) Multi-label neural networks with applications to functional genomics and text categorization. IEEE Trans Knowl Data Eng 18(10):1338–1351
Article Google Scholar
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. ESEC/SIGSOFT FSE 2009, Amsterdam, The Netherlands, pp 91–100

Download references

Acknowledgements

We are grateful to anonymous EMSE and ICSM’11 reviewers for their relevant detailed comments and suggestions, which helped us in significantly improving the initial version of this paper. We also would like to thank Marty White from the College of William and Mary for his useful suggestions that we implemented in this paper. This work is supported in part by NSF CCF-1016868, CCF-0916260, CCF-0916139, CCF-1017633, CCF-1218129 and Accenture. Any opinions, findings and conclusions expressed herein are the authors’ and do not necessarily reflect those of the sponsors.

Author information

Authors and Affiliations

The College of William and Mary, Williamsburg, VA, 23185, USA
Mario Linares-Vásquez & Denys Poshyvanyk
Universitry of Notre Dame, Notre Dame, IN, 46556, USA
Collin McMillan
University of Illinois at Chicago, Chicago, IL, 60612, USA
Mark Grechanik

Authors

Mario Linares-Vásquez
View author publications
You can also search for this author in PubMed Google Scholar
Collin McMillan
View author publications
You can also search for this author in PubMed Google Scholar
Denys Poshyvanyk
View author publications
You can also search for this author in PubMed Google Scholar
Mark Grechanik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Denys Poshyvanyk.

Additional information

Editor: Paolo Tonella

Rights and permissions

Reprints and permissions

About this article

Cite this article

Linares-Vásquez, M., McMillan, C., Poshyvanyk, D. et al. On using machine learning to automatically classify software applications into domain categories. Empir Software Eng 19, 582–618 (2014). https://doi.org/10.1007/s10664-012-9230-z

Download citation

Published: 10 October 2012
Issue Date: June 2014
DOI: https://doi.org/10.1007/s10664-012-9230-z

On using machine learning to automatically classify software applications into domain categories

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Combining Clustering and Classification for Software Quality Evaluation

Emerging topics in mining software repositories

Mining Repository for Module Reuse: A Machine Learning-Based Approach

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

On using machine learning to automatically classify software applications into domain categories

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Combining Clustering and Classification for Software Quality Evaluation

Emerging topics in mining software repositories

Mining Repository for Module Reuse: A Machine Learning-Based Approach

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation