[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/ICSE.2019.00086acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

A novel neural source code representation based on abstract syntax tree

Published: 25 May 2019 Publication History

Abstract

Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for follow-up analysis. Traditional information retrieval based methods often treat programs as natural language texts, which could miss important semantic information of source code. Recently, state-of-the-art studies demonstrate that abstract syntax tree (AST) based neural models can better represent source code. However, the sizes of ASTs are usually large and the existing models are prone to the long-term dependency problem. In this paper, we propose a novel AST-based Neural Network (ASTNN) for source code representation. Unlike existing models that work on entire ASTs, ASTNN splits each large AST into a sequence of small statement trees, and encodes the statement trees to vectors by capturing the lexical and syntactical knowledge of statements. Based on the sequence of statement vectors, a bidirectional RNN model is used to leverage the naturalness of statements and finally produce the vector representation of a code fragment. We have applied our neural network based source code representation method to two common program comprehension tasks: source code classification and code clone detection. Experimental results on the two tasks indicate that our model is superior to state-of-the-art approaches.

References

[1]
G. Frantzeskou, S. MacDonell, E. Stamatatos, and S. Gritzalis, "Examining the significance of high-level programming features in source code author classification," Journal of Systems and Software, vol. 81, no. 3, pp. 447--460, 2008.
[2]
L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin, "Convolutional neural networks over tree structures for programming language processing," in AAAI, vol. 2, no. 3, 2016, p. 4.
[3]
T. Kamiya, S. Kusumoto, and K. Inoue, "CCFinder: a multilinguistic token-based code clone detection system for large scale source code," IEEE Transactions on Software Engineering, vol. 28, no. 7, pp. 654--670, 2002.
[4]
H. Sajnani, V. Saini, J. Svajlenko, C. K. Roy, and C. V. Lopes, "SourcererCC: Scaling code clone detection to big-code," in Software Engineering (ICSE), 2016 IEEE/ACM 38th International Conference on. IEEE, 2016, pp. 1157--1168.
[5]
M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, "Deep learning code fragments for code clone detection," in Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 2016, pp. 87--98.
[6]
H.-H. Wei and M. Li, "Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code," in Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017, pp. 3034--3040.
[7]
M. DAmbros, M. Lanza, and R. Robbes, "Evaluating defect prediction approaches: a benchmark and an extensive comparison," Empirical Software Engineering, vol. 17, no. 4--5, pp. 531--577, 2012.
[8]
C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, "An empirical comparison of model validation techniques for defect prediction models," IEEE Transactions on Software Engineering, vol. 43, no. 1, pp. 1--18, 2017.
[9]
S. Haiduc, J. Aponte, and A. Marcus, "Supporting program comprehension with source code summarization," in Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 2. ACM, 2010, pp. 223--226.
[10]
S. Jiang, A. Armaly, and C. McMillan, "Automatically generating commit messages from diffs using neural machine translation," in Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. IEEE Press, 2017, pp. 135--146.
[11]
J. Zhou, H. Zhang, and D. Lo, "Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports," in 2012 34th International Conference on Software Engineering (ICSE), June 2012, pp. 14--24.
[12]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by latent semantic analysis," Journal of the American society for information science, vol. 41, no. 6, p. 391, 1990.
[13]
D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," Journal of machine Learning research, vol. 3, no. Jan, pp. 993--1022, 2003.
[14]
R. Tairas and J. Gray, "An information retrieval process to aid in the analysis of code clones," Empirical Software Engineering, vol. 14, no. 1, pp. 33--56, 2009.
[15]
Y. Liu, D. Poshyvanyk, R. Ferenc, T. Gyimóthy, and N. Chrisochoides, "Modeling class cohesion as mixtures of latent topics," in Software Maintenance, 2009. ICSM 2009. IEEE International Conference on. IEEE, 2009, pp. 233--242.
[16]
A. De Lucia, M. Di Penta, R. Oliveto, A. Panichella, and S. Panichella, "Using IR methods for labeling source code artifacts: Is it worthwhile?" in Program Comprehension (ICPC), 2012 IEEE 20th International Conference on. IEEE, 2012, pp. 193--202.
[17]
A. Panichella, B. Dit, R. Oliveto, M. Di Penta, D. Poshynanyk, and A. De Lucia, "How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms," in Software Engineering (ICSE), 2013 35th International Conference on. IEEE, 2013, pp. 522--531.
[18]
J. F. Pane, B. A. Myers et al., "Studying the language and structure in non-programmers solutions to programming problems," International Journal of Human-Computer Studies, vol. 54, no. 2, pp. 237--264, 2001.
[19]
Y. Bengio, P. Simard, and P. Frasconi, "Learning long-term dependencies with gradient descent is difficult," IEEE transactions on neural networks, vol. 5, no. 2, pp. 157--166, 1994.
[20]
S. Hochreiter, "The vanishing gradient problem during learning recurrent neural nets and problem solutions," Int. J. Uncertain. Fuzziness Knowl.-Based Syst., vol. 6, no. 2, pp. 107--116, Apr. 1998. {Online}. Available
[21]
P. Le and W. H. Zuidema, "Quantifying the vanishing gradient and long distance dependency problem in recursive neural networks and recursive LSTMs," in Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, 2016, p. 8793.
[22]
S.-Y. Cho, Z. Chi, W.-C. Siu, and A. C. Tsoi, "An improved algorithm for learning long-term dependency problems in adaptive processing of data structures," IEEE Transactions on Neural Networks, vol. 14, no. 4, pp. 781--793, 2003.
[23]
X. Zhu, P. Sobhani, and H. Guo, "Long short-term memory over recursive structures," in Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ser. ICML'15. JMLR.org, 2015, pp. 1604--1612. {Online}. Available: http://dl.acm.org/citation.cfm?id=3045118.3045289
[24]
M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, "Asymmetric transitivity preserving graph embedding," in Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD '16. New York, NY, USA: ACM, 2016, pp. 1105--1114. {Online}. Available
[25]
M. Allamanis, M. Brockschmidt, and M. Khademi, "Learning to represent programs with graphs," in International Conference on Learning Representations, 2018. {Online}. Available: https://openreview.net/forum?id=BJOFETxR-
[26]
G. B. M. D. P. M. W. Michele Tufano, Cody Watson and D. Poshyvanyk, "Deep learning similarities from different representations of source code," in 15th International Conference on Mining Software Repositories, 2018.
[27]
J. Ferrante, K. J. Ottenstein, and J. D. Warren, "The program dependence graph and its use in optimization," ACM Trans. Program. Lang. Syst., vol. 9, no. 3, pp. 319--349, Jul. 1987. {Online}. Available
[28]
E. M. Myers, "A precise inter-procedural data flow algorithm," in Proceedings of the 8th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ser. POPL '81. New York, NY, USA: ACM, 1981, pp. 219--230. {Online}. Available
[29]
"Llvms analysis and transform passes," https://llvm.org/docs/Passes.html, accessed August 3, 2018.
[30]
J. Gosling, B. Joy, G. L. Steele, G. Bracha, and A. Buckley, The Java Language Specification, Java SE 8 Edition, 1st ed. Addison-Wesley Professional, 2014.
[31]
T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, "Recurrent neural network based language model," in Eleventh Annual Conference of the International Speech Communication Association, 2010.
[32]
A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, "On the naturalness of software," in Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 2012, pp. 837--847.
[33]
B. Ray, V. Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli, and P. Devanbu, "On the "naturalness" of buggy code," in Proceedings of the 38th International Conference on Software Engineering, ser. ICSE '16. New York, NY, USA: ACM, 2016, pp. 428--439. {Online}. Available
[34]
D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.
[35]
D. Tang, B. Qin, and T. Liu, "Document modeling with gated recurrent neural network for sentiment classification," in Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 1422--1432.
[36]
I. D. Baxter, A. Yahin, L. Moura, M. Sant' Anna, and L. Bier, "Clone detection using abstract syntax trees," in Software Maintenance, 1998. Proceedings., International Conference on. IEEE, 1998, pp. 368--377.
[37]
S. Paul and A. Prakash, "A framework for source code search using program patterns," IEEE Transactions on Software Engineering, vol. 20, no. 6, pp. 463--475, 1994.
[38]
W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest, "Automatically finding patches using genetic programming," in Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 2009, pp. 364--374.
[39]
J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus, "Fine-grained and accurate source code differencing," in Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. ACM, 2014, pp. 313--324.
[40]
R. Socher, C. C. Lin, C. Manning, and A. Y. Ng, "Parsing natural scenes and natural language with recursive neural networks," in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 129--136.
[41]
K. S. Tai, R. Socher, and C. D. Manning, "Improved semantic representations from tree-structured long short-term memory networks," in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), vol. 1, 2015, pp. 1556--1566.
[42]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," in Advances in neural information processing systems, 2013, pp. 3111--3119.
[43]
X. Gu, H. Zhang, D. Zhang, and S. Kim, "Deep API learning," in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 2016, pp. 631--642.
[44]
S. Kawaguchi, P. K. Garg, M. Matsushita, and K. Inoue, "Mudablue: An automatic categorization system for open source repositories," Journal of Systems and Software, vol. 79, no. 7, pp. 939--953, 2006.
[45]
M. Linares-Vásquez, C. McMillan, D. Poshyvanyk, and M. Grechanik, "On using machine learning to automatically classify software applications into domain categories," Empirical Software Engineering, vol. 19, no. 3, pp. 582--618, 2014.
[46]
D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
[47]
J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia, "Towards a big data curated benchmark of inter-project code clones," in Software Maintenance and Evolution (ICSME), 2014 IEEE International Conference on. IEEE, 2014, pp. 476--480.
[48]
A. Charpentier, J.-R. Falleri, D. Lo, and L. Réveillère, "An empirical assessment of bellon's clone benchmark," in Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering. ACM, 2015, p. 20.
[49]
P. Accioly, P. Borba, and G. Cavalcanti, "Understanding semi-structured merge conflict characteristics in open-source java projects," Empirical Software Engineering, pp. 1--35, 2017.
[50]
Y. Kim, "Convolutional neural networks for sentence classification," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1746--1751.
[51]
W. Zaremba and I. Sutskever, "Learning to execute," arXiv preprint arXiv:1410.4615, 2014.
[52]
X. Huo and M. Li, "Enhancing the unified features to locate buggy files by exploiting the sequential nature of source code," in Proceedings of the 26th International Joint Conference on Artificial Intelligence, ser. IJCAI'17. AAAI Press, 2017, pp. 1909--1915. {Online}. Available: http://dl.acm.org/citation.cfm?id=3172077.3172153
[53]
Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, "Gated graph sequence neural networks," arXiv preprint arXiv:1511.05493, 2015.
[54]
C. K. Roy and J. R. Cordy, "A survey on software clone detection research," Queens School of Computing TR, vol. 541, no. 115, pp. 64--68, 2007.
[55]
L. Jiang, G. Misherghi, Z. Su, and S. Glondu, "DECKARD: Scalable and accurate tree-based detection of code clones," in Proceedings of the 29th International Conference on Software Engineering, ser. ICSE '07. Washington, DC, USA: IEEE Computer Society, 2007, pp. 96--105. {Online}. Available
[56]
J. I. Maletic and A. Marcus, "Supporting program comprehension using semantic and structural information," in Proceedings of the 23rd International Conference on Software Engineering. IEEE Computer Society, 2001, pp. 103--112.
[57]
M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, "A survey of machine learning for big code and naturalness," arXiv preprint arXiv:1709.06182, 2017.
[58]
V. Raychev, M. Vechev, and E. Yahav, "Code completion with statistical language models," in Acm Sigplan Notices, vol. 49, no. 6. ACM, 2014, pp. 419--428.
[59]
M. Allamanis, E. T. Barr, C. Bird, and C. Sutton, "Suggesting accurate method and class names," in Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 2015, pp. 38--49.
[60]
S. Wang, T. Liu, and L. Tan, "Automatically learning semantic features for defect prediction," in Proceedings of the 38th International Conference on Software Engineering. ACM, 2016, pp. 297--308.
[61]
M. Pradel and K. Sen, "DeepBugs: A learning approach to name-based bug detection," Proc. ACM Program. Lang., vol. 2, no. OOPSLA, pp. 147:1--147:25, Oct. 2018. {Online}. Available
[62]
G. Zhao and J. Huang, "DeepSim: Deep learning code functional similarity," in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2018. New York, NY, USA: ACM, 2018, pp. 141--151. {Online}. Available
[63]
A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, "Combining deep learning with information retrieval to localize buggy files for bug reports (n)," in Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on. IEEE, 2015, pp. 476--481.
[64]
B. Xu, D. Ye, Z. Xing, X. Xia, G. Chen, and S. Li, "Predicting semantically linkable knowledge in developer online forums via convolutional neural network," in Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 2016, pp. 51--62.
[65]
J. Guo, J. Cheng, and J. Cleland-Huang, "Semantically enhanced software traceability using deep learning techniques," in Proceedings of the 39th International Conference on Software Engineering. IEEE Press, 2017, pp. 3--14.
[66]
X. Gu, H. Zhang, and S. Kim, "Deep code search," in Proceedings of the 2018 40th International Conference on Software Engineering (ICSE 2018). ACM, 2018.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '19: Proceedings of the 41st International Conference on Software Engineering
May 2019
1318 pages

Sponsors

Publisher

IEEE Press

Publication History

Published: 25 May 2019

Check for updates

Badges

Author Tags

  1. abstract syntax tree
  2. code classification
  3. code clone detection
  4. neural network
  5. source code representation

Qualifiers

  • Research-article

Conference

ICSE '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)39
  • Downloads (Last 6 weeks)6
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Enhancing structural knowledge in code smell identificationExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.125725263:COnline publication date: 5-Mar-2025
  • (2025)Survey of source code vulnerability analysis based on deep learningComputers and Security10.1016/j.cose.2024.104098148:COnline publication date: 1-Jan-2025
  • (2024)FIREProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3699005(1867-1884)Online publication date: 14-Aug-2024
  • (2024)Large language models for code analysisProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3698947(829-846)Online publication date: 14-Aug-2024
  • (2024)Automating TODO-missed Methods Detection and PatchingACM Transactions on Software Engineering and Methodology10.1145/370079334:1(1-28)Online publication date: 19-Dec-2024
  • (2024)A Longitudinal Analysis Of Replicas in the Wild Wild AndroidProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695546(1821-1833)Online publication date: 27-Oct-2024
  • (2024)Mutual Learning-Based Framework for Enhancing Robustness of Code Models via Adversarial TrainingProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695519(1484-1496)Online publication date: 27-Oct-2024
  • (2024)AACEGEN: Attention Guided Adversarial Code Example Generation for Deep Code ModelsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695500(1245-1257)Online publication date: 27-Oct-2024
  • (2024)Trident: Detecting SQL Injection Attacks via Abstract Syntax Tree-based Neural NetworkProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695289(2225-2229)Online publication date: 27-Oct-2024
  • (2024)Detect Hidden Dependency to Untangle CommitsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3694996(179-190)Online publication date: 27-Oct-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media