[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1342211.1342234acmconferencesArticle/Chapter ViewAbstractPublication PagesisecConference Proceedingsconference-collections
research-article

Mining business topics in source code using latent dirichlet allocation

Published: 19 February 2008 Publication History

Abstract

One of the difficulties in maintaining a large software system is the absence of documented business domain topics and correlation between these domain topics and source code. Without such a correlation, people without any prior application knowledge would find it hard to comprehend the functionality of the system. Latent Dirichlet Allocation (LDA), a statistical model, has emerged as a popular technique for discovering topics in large text document corpus. But its applicability in extracting business domain topics from source code has not been explored so far. This paper investigates LDA in the context of comprehending large software systems and proposes a human assisted approachbased on LDA for extracting domain topics from source code. This method has been applied on a number of open source and proprietary systems. Preliminary results indicate that LDA is able to identify some of the domain topics and isa satisfactory starting point for further manual refinement of topics

References

[1]
Source navigator 5.4.1. http://sourcenav.sourceforge.net, 2003.
[2]
P. Anderson and M. Zarins. The codesurfer software understanding platform. In IWPC, pages 147--148. IEEE Computer Society, 2005.
[3]
N. Anquetil and T.C. Lethbridge. Recovering software architecture from the names of source files. Journal of Software Maintenance: Research and Practice, 11:201--221, 1999.
[4]
G. Antoniol, G. Canfora, G. Casazza, A.D. Lucia, and E. Merlo. Recovering traceability links between code and documentation. IEEE Transactions in Software Engineering, 28(10):970--983, 2002.
[5]
T.J. Biggerstaff, B.G. Mitbander, and D. Webster. Program understanding and the concept assignment problem. Communications of the ACM, 37(5):72--83, May 1994.
[6]
Z. Bin, M. David, and L. Xinghua. Identifying biological concepts from a protein-related corpus with a probabilistic topic model. BMC Bioinformatics, 7, 2006.
[7]
D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.
[8]
B. Caprile and P. Tonella. Nomen est omen: Analyzing the language of function identifiers. In Proceedings of the Sixth Working Conference on Reverse Engineering, 1999.
[9]
S. Deerwester, S.T. Dumais, G.W. Furnas, and T.K. Landauer. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391--407, 1990.
[10]
T. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, pages 5228--5235, 2004.
[11]
S. Kawaguchi, P.K. Garg, M. Matsushita, and K. Inoue. MUDABlue: An automatic categorization system for open source repositories. In APSEC, pages 184--193. IEEE Computer Society, 2004.
[12]
A. Kuhn. Semantic clustering: Making use of linguistic information to reveal concepts in source code. Master's thesis, University of Bern, 2006.
[13]
A. Kuhn, S. Ducasse, and T. Gîrba. Semantic clustering: Identifying topics in source code. IST, 2006. To appear.
[14]
J. Lafferty and T. Minka. Expectation-propagation for the generative aspect model. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, 2002.
[15]
A. Marcus and J.I. Maletic. Identification of high-level concept clones in source code. In Proceedings of the 16th International Conference on Automated Software Engineering (ASE 2001), pages 107--114, Nov. 2001.
[16]
A. Marcus and J.I. Maletic. Recovering documentation-to-source-code traceability links using latent semantic indexing. In International Conference on Software Engineering, pages 125--134. IEEE Computer Society Press, may 2003.
[17]
A. Marcus, A. Sergeyev, V. Rajlich, and J. Maletic. An information retrieval approach to concept location in source code. In Proceedings of the 11th Working Conference on Reverse Engineering (WCRE 2004), pages 214--223, Nov. 2004.
[18]
A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in social networks. In L.P. Kaelbling and A. Saffiotti, editors, IJCAI, pages 786--791. Professional Book Center, 2005.
[19]
D. Newman, C. Chemudugunta, P. Smyth, and M. Steyvers. Analyzing entities and topics in news articles using statistical topic models. In Lecture Notes on Computer Science. Springer-Verlag, 2006.
[20]
M.F. Porter. An algorithm for suffix stripping. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.
[21]
D. Poshyvanyk and A. Marcus. Combining formal concept analysis with information retrieval for concept location in source code. In ICPC, pages 37--48. IEEE Computer Society, 2007.
[22]
M. Steyvers, P. Smyth, M. Rosen-Zvi, and T.L. Griffiths. Probabilistic author-topic models for information discovery. In W. Kim, R. Kohavi, J. Gehrke, and W. DuMouchel, editors, KDD, pages 306--315. ACM, 2004.
[23]
S. Ugurel, R. Krovetz, C.L. Giles, D.M. Pennock, E.J. Glover, and HZha. What's the code? automatic classification of source code archives. In Proceedings of the eigth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 632--638, 2002.
[24]
N. Wilde, M. Buckellew, H. Page, V. Rajlich, and L. Pounds. A comparison of methods for locating features in legacy software. Journal of Systems and Software, 65(2):105--114, 2003.
[25]
W. Zhao, L. Zhang, Y. Liu, J. Sun, and F. Yang. Sniafl: Towards a static noninteractive approach to feature location. ACM Transactions on Software Engineering and Methodology, 15(2):195--226, April 2006.

Cited By

View all
  • (2024)Software Ecosystem Orchestration with Topic ModelingProceedings of the 7th ACM/IEEE International Workshop on Software-intensive Business10.1145/3643690.3648245(72-78)Online publication date: 16-Apr-2024
  • (2023)Identification of Hydrogen-Energy-Related Emerging Technologies Based on Text MiningSustainability10.3390/su1601014716:1(147)Online publication date: 22-Dec-2023
  • (2023)Neighborhood Identity Formation and the Changes in an Urban Regeneration Neighborhood in Gwangju, KoreaSustainability10.3390/su15151179215:15(11792)Online publication date: 31-Jul-2023
  • Show More Cited By

Recommendations

Reviews

Andrew Brooks

Techniques for automatic topic extraction from text are well established, but can they be applied to source code to identify undocumented domain topics to ease the job of software maintenance__?__ Seeking an answer to this question, the authors applied the latent Dirichlet allocation (LDA) statistical model to the source code of several systems. Their approach is, as yet, not fully automatic. They describe how human intervention is required to iteratively adjust the values of several input parameters, to establish stop-word lists to filter out programming words, and to properly label the topics output from LDA. Only a few example results are provided showing how LDA can identify topics. Attempts to automatically derive the optimal number of topics to be extracted were not without problems: a maximum likelihood method is reported as suggesting six times as many topics as human experts felt were reasonable. This exploratory paper does not present detailed results from systematic experimentation. The reader, for example, is not informed of the fraction of identified topics that could be considered useful in software maintenance. The reader is informed of the results of a very limited sensitivity analysis that considered just two values for the number of topics to be extracted (see Table 5). What this exploratory paper does provide, however, are insights into the complexity of LDA and the difficulties of making an LDA approach work automatically when applied to source code. As such, it is recommended only to those researching automatic topic extraction from source code. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISEC '08: Proceedings of the 1st India software engineering conference
February 2008
164 pages
ISBN:9781595939173
DOI:10.1145/1342211
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 February 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. LDA
  2. maintenance
  3. program comprehension

Qualifiers

  • Research-article

Conference

ISEC08
Sponsor:
ISEC08: India Software Engineering Conference
February 19 - 22, 2008
Hyderabad, India

Acceptance Rates

Overall Acceptance Rate 76 of 315 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Software Ecosystem Orchestration with Topic ModelingProceedings of the 7th ACM/IEEE International Workshop on Software-intensive Business10.1145/3643690.3648245(72-78)Online publication date: 16-Apr-2024
  • (2023)Identification of Hydrogen-Energy-Related Emerging Technologies Based on Text MiningSustainability10.3390/su1601014716:1(147)Online publication date: 22-Dec-2023
  • (2023)Neighborhood Identity Formation and the Changes in an Urban Regeneration Neighborhood in Gwangju, KoreaSustainability10.3390/su15151179215:15(11792)Online publication date: 31-Jul-2023
  • (2023)Towards semantically enhanced detection of emerging quality-related concerns in source codeSoftware Quality Journal10.1007/s11219-023-09614-831:3(865-915)Online publication date: 17-Feb-2023
  • (2023) PyScribe –Learning to describe python code Software: Practice and Experience10.1002/spe.329154:3(501-527)Online publication date: 9-Dec-2023
  • (2022)How does code reviewing feedback evolve?Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice10.1145/3510457.3513039(151-160)Online publication date: 21-May-2022
  • (2022)How Does Code Reviewing Feedback Evolve?: A Longitudinal Study at Dell EMC2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)10.1109/ICSE-SEIP55303.2022.9793920(151-160)Online publication date: May-2022
  • (2021)Logging Analysis and Prediction in Open Source Java ProjectResearch Anthology on Usage and Development of Open Source Software10.4018/978-1-7998-9158-1.ch038(733-761)Online publication date: 2021
  • (2021)A Multi-Module Based Method for Generating Natural Language Descriptions of Code FragmentsIEEE Access10.1109/ACCESS.2021.30559559(21579-21592)Online publication date: 2021
  • (2021)Topic modeling for feature location in software modelsInformation and Software Technology10.1016/j.infsof.2021.106676140:COnline publication date: 1-Dec-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media