research-article

Mining business topics in source code using latent dirichlet allocation

Authors:

Girish Maskeri,

Santonu Sarkar,

Kenneth HeafieldAuthors Info & Claims

ISEC '08: Proceedings of the 1st India software engineering conference

Pages 113 - 120

https://doi.org/10.1145/1342211.1342234

Published: 19 February 2008 Publication History

Get Access

Abstract

One of the difficulties in maintaining a large software system is the absence of documented business domain topics and correlation between these domain topics and source code. Without such a correlation, people without any prior application knowledge would find it hard to comprehend the functionality of the system. Latent Dirichlet Allocation (LDA), a statistical model, has emerged as a popular technique for discovering topics in large text document corpus. But its applicability in extracting business domain topics from source code has not been explored so far. This paper investigates LDA in the context of comprehending large software systems and proposes a human assisted approachbased on LDA for extracting domain topics from source code. This method has been applied on a number of open source and proprietary systems. Preliminary results indicate that LDA is able to identify some of the domain topics and isa satisfactory starting point for further manual refinement of topics

References

[1]

Source navigator 5.4.1. http://sourcenav.sourceforge.net, 2003.

Google Scholar

[2]

P. Anderson and M. Zarins. The codesurfer software understanding platform. In IWPC, pages 147--148. IEEE Computer Society, 2005.

Digital Library

Google Scholar

[3]

N. Anquetil and T.C. Lethbridge. Recovering software architecture from the names of source files. Journal of Software Maintenance: Research and Practice, 11:201--221, 1999.

Digital Library

Google Scholar

[4]

G. Antoniol, G. Canfora, G. Casazza, A.D. Lucia, and E. Merlo. Recovering traceability links between code and documentation. IEEE Transactions in Software Engineering, 28(10):970--983, 2002.

Digital Library

Google Scholar

[5]

T.J. Biggerstaff, B.G. Mitbander, and D. Webster. Program understanding and the concept assignment problem. Communications of the ACM, 37(5):72--83, May 1994.

Digital Library

Google Scholar

[6]

Z. Bin, M. David, and L. Xinghua. Identifying biological concepts from a protein-related corpus with a probabilistic topic model. BMC Bioinformatics, 7, 2006.

Google Scholar

[7]

D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.

Digital Library

Google Scholar

[8]

B. Caprile and P. Tonella. Nomen est omen: Analyzing the language of function identifiers. In Proceedings of the Sixth Working Conference on Reverse Engineering, 1999.

Digital Library

Google Scholar

[9]

S. Deerwester, S.T. Dumais, G.W. Furnas, and T.K. Landauer. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391--407, 1990.

Crossref

Google Scholar

[10]

T. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, pages 5228--5235, 2004.

Crossref

Google Scholar

[11]

S. Kawaguchi, P.K. Garg, M. Matsushita, and K. Inoue. MUDABlue: An automatic categorization system for open source repositories. In APSEC, pages 184--193. IEEE Computer Society, 2004.

Digital Library

Google Scholar

[12]

A. Kuhn. Semantic clustering: Making use of linguistic information to reveal concepts in source code. Master's thesis, University of Bern, 2006.

Google Scholar

[13]

A. Kuhn, S. Ducasse, and T. Gîrba. Semantic clustering: Identifying topics in source code. IST, 2006. To appear.

Digital Library

Google Scholar

[14]

J. Lafferty and T. Minka. Expectation-propagation for the generative aspect model. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, 2002.

Digital Library

Google Scholar

[15]

A. Marcus and J.I. Maletic. Identification of high-level concept clones in source code. In Proceedings of the 16th International Conference on Automated Software Engineering (ASE 2001), pages 107--114, Nov. 2001.

Digital Library

Google Scholar

[16]

A. Marcus and J.I. Maletic. Recovering documentation-to-source-code traceability links using latent semantic indexing. In International Conference on Software Engineering, pages 125--134. IEEE Computer Society Press, may 2003.

Digital Library

Google Scholar

[17]

A. Marcus, A. Sergeyev, V. Rajlich, and J. Maletic. An information retrieval approach to concept location in source code. In Proceedings of the 11th Working Conference on Reverse Engineering (WCRE 2004), pages 214--223, Nov. 2004.

Digital Library

Google Scholar

[18]

A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in social networks. In L.P. Kaelbling and A. Saffiotti, editors, IJCAI, pages 786--791. Professional Book Center, 2005.

Digital Library

Google Scholar

[19]

D. Newman, C. Chemudugunta, P. Smyth, and M. Steyvers. Analyzing entities and topics in news articles using statistical topic models. In Lecture Notes on Computer Science. Springer-Verlag, 2006.

Digital Library

Google Scholar

[20]

M.F. Porter. An algorithm for suffix stripping. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.

Google Scholar

[21]

D. Poshyvanyk and A. Marcus. Combining formal concept analysis with information retrieval for concept location in source code. In ICPC, pages 37--48. IEEE Computer Society, 2007.

Digital Library

Google Scholar

[22]

M. Steyvers, P. Smyth, M. Rosen-Zvi, and T.L. Griffiths. Probabilistic author-topic models for information discovery. In W. Kim, R. Kohavi, J. Gehrke, and W. DuMouchel, editors, KDD, pages 306--315. ACM, 2004.

Digital Library

Google Scholar

[23]

S. Ugurel, R. Krovetz, C.L. Giles, D.M. Pennock, E.J. Glover, and HZha. What's the code? automatic classification of source code archives. In Proceedings of the eigth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 632--638, 2002.

Digital Library

Google Scholar

[24]

N. Wilde, M. Buckellew, H. Page, V. Rajlich, and L. Pounds. A comparison of methods for locating features in legacy software. Journal of Systems and Software, 65(2):105--114, 2003.

Digital Library

Google Scholar

[25]

W. Zhao, L. Zhang, Y. Liu, J. Sun, and F. Yang. Sniafl: Towards a static noninteractive approach to feature location. ACM Transactions on Software Engineering and Methodology, 15(2):195--226, April 2006.

Digital Library

Google Scholar

Cited By

View all

Van Schothorst CSchuurmans RJansen S(2024)Software Ecosystem Orchestration with Topic ModelingProceedings of the 7th ACM/IEEE International Workshop on Software-intensive Business10.1145/3643690.3648245(72-78)Online publication date: 16-Apr-2024
https://dl.acm.org/doi/10.1145/3643690.3648245
Lin YZhou Y(2023)Identification of Hydrogen-Energy-Related Emerging Technologies Based on Text MiningSustainability10.3390/su1601014716:1(147)Online publication date: 22-Dec-2023
https://doi.org/10.3390/su16010147
Yun HKwon H(2023)Neighborhood Identity Formation and the Changes in an Urban Regeneration Neighborhood in Gwangju, KoreaSustainability10.3390/su15151179215:15(11792)Online publication date: 31-Jul-2023
https://doi.org/10.3390/su151511792
Show More Cited By

Index Terms

Mining business topics in source code using latent dirichlet allocation
1. Applied computing
  1. Document management and text processing
2. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Software reverse engineering

Recommendations

Latent dirichlet allocation based multi-document summarization
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Extraction based Multi-Document Summarization Algorithms consist of choosing sentences from the documents using some weighting mechanism and combining them into a summary. In this article we use Latent Dirichlet Allocation to capture the events being ...
Obtaining single document summaries using latent dirichlet allocation
ICONIP'12: Proceedings of the 19th international conference on Neural Information Processing - Volume Part IV

In this paper, we present a novel approach that makes use of topic models based on Latent Dirichlet allocation(LDA) for generating single document summaries. Our approach is distinguished from other LDA based approaches in that we identify the summary ...
Sequential latent Dirichlet allocation

Understanding how topics within a document evolve over the structure of the document is an interesting and potentially important problem in exploratory and predictive text analytics. In this article, we address this problem by presenting a novel variant ...

Reviews

Reviewer: Andrew Brooks

Techniques for automatic topic extraction from text are well established, but can they be applied to source code to identify undocumented domain topics to ease the job of software maintenance__?__ Seeking an answer to this question, the authors applied the latent Dirichlet allocation (LDA) statistical model to the source code of several systems. Their approach is, as yet, not fully automatic. They describe how human intervention is required to iteratively adjust the values of several input parameters, to establish stop-word lists to filter out programming words, and to properly label the topics output from LDA. Only a few example results are provided showing how LDA can identify topics. Attempts to automatically derive the optimal number of topics to be extracted were not without problems: a maximum likelihood method is reported as suggesting six times as many topics as human experts felt were reasonable. This exploratory paper does not present detailed results from systematic experimentation. The reader, for example, is not informed of the fraction of identified topics that could be considered useful in software maintenance. The reader is informed of the results of a very limited sensitivity analysis that considered just two values for the number of topics to be extracted (see Table 5). What this exploratory paper does provide, however, are insights into the complexity of LDA and the difficulties of making an LDA approach work automatically when applied to source code. As such, it is recommended only to those researching automatic topic extraction from source code. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

ISEC '08: Proceedings of the 1st India software engineering conference

February 2008

164 pages

ISBN:9781595939173

DOI:10.1145/1342211

General Chair:
Gautam Shroff
TCS, India
,
Program Chairs:
Pankaj Jalote
Indian Institute of Technology Delhi, India
,
Sriram Rajamani
Microsoft Research, India

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 February 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISEC08

Sponsor:

ISEC08: India Software Engineering Conference

February 19 - 22, 2008

Hyderabad, India

Acceptance Rates

Overall Acceptance Rate 76 of 315 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

101
Total Citations
View Citations
1,265
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Van Schothorst CSchuurmans RJansen S(2024)Software Ecosystem Orchestration with Topic ModelingProceedings of the 7th ACM/IEEE International Workshop on Software-intensive Business10.1145/3643690.3648245(72-78)Online publication date: 16-Apr-2024
https://dl.acm.org/doi/10.1145/3643690.3648245
Lin YZhou Y(2023)Identification of Hydrogen-Energy-Related Emerging Technologies Based on Text MiningSustainability10.3390/su1601014716:1(147)Online publication date: 22-Dec-2023
https://doi.org/10.3390/su16010147
Yun HKwon H(2023)Neighborhood Identity Formation and the Changes in an Urban Regeneration Neighborhood in Gwangju, KoreaSustainability10.3390/su15151179215:15(11792)Online publication date: 31-Jul-2023
https://doi.org/10.3390/su151511792
Krasniqi RDo H(2023)Towards semantically enhanced detection of emerging quality-related concerns in source codeSoftware Quality Journal10.1007/s11219-023-09614-831:3(865-915)Online publication date: 17-Feb-2023
https://doi.org/10.1007/s11219-023-09614-8
Guo JLiu JLiu XWan YZhao YLi LLiu KKlein JBissyandé T(2023) PyScribe –Learning to describe python code Software: Practice and Experience10.1002/spe.329154:3(501-527)Online publication date: 9-Dec-2023
https://doi.org/10.1002/spe.3291
Wen RLamothe MMcIntosh SHarman MMiller H(2022)How does code reviewing feedback evolve?Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice10.1145/3510457.3513039(151-160)Online publication date: 21-May-2022
https://dl.acm.org/doi/10.1145/3510457.3513039
Wen RLamothe MMcIntosh S(2022)How Does Code Reviewing Feedback Evolve?: A Longitudinal Study at Dell EMC2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)10.1109/ICSE-SEIP55303.2022.9793920(151-160)Online publication date: May-2022
https://doi.org/10.1109/ICSE-SEIP55303.2022.9793920
Lal SSardana NSureka A(2021)Logging Analysis and Prediction in Open Source Java ProjectResearch Anthology on Usage and Development of Open Source Software10.4018/978-1-7998-9158-1.ch038(733-761)Online publication date: 2021
https://doi.org/10.4018/978-1-7998-9158-1.ch038
Gao XJiang XWu QWang XLyu LLyu C(2021)A Multi-Module Based Method for Generating Natural Language Descriptions of Code FragmentsIEEE Access10.1109/ACCESS.2021.30559559(21579-21592)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3055955
Pérez FLapeña RMarcén ACetina C(2021)Topic modeling for feature location in software modelsInformation and Software Technology10.1016/j.infsof.2021.106676140:COnline publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1016/j.infsof.2021.106676
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Latent dirichlet allocation based multi-document summarization

Obtaining single document summaries using latent dirichlet allocation

Sequential latent Dirichlet allocation

Reviews

Access critical reviews of Computing literature here