In this section, we examine approaches that share the same goals, techniques, or application domain as fold2vec.
Code summarization. In the non-autoregressive family of models, we can enumerate
code2vec [
7] and
Paths+CRFs [
6], which we have already discussed extensively. Autoregressive models produce a sequence of predictions where each prediction is based on the previous one. Even if these are promising approaches, they better fit when the classification can be split into multiple steps.
code2seq [
5] and
ConvAttention [
3] fall into this category. The former is, once again, based on the leaf-to-leaf path representation, whereas the latter is based on source code tokenization. Both use NNs to achieve their purpose: attention based and convolution based, respectively. Although we focused on deep learning methods, there are other techniques using more traditional approaches. For example, Hellendoorn and Devanbu [
30] use an n-gram-based language model. Raychev et al. [
64] use support vector machines constrained to predict only few method names. Wang et al. [
74] train an autoregressive model using reinforcement learning using two different datasets for Python and Java code.
Intent identification. Since intents and concerns can be overlapping concepts, in this section we present some work on concern identification. Kästner et al. [
42] proposed a semi-automatic tool for extracting features from code. It relies on a configuration given from a domain expert to correctly recognize the features. In the same research area, Valente et al. [
71] proposed an approach to semi-automatically annotate optional features. UML model variants have been used to automatically identify model-based product lines in the work of Martinez et al. [
50]. Qu and Liu [
62], Breu and Krinke [
13], and Tonella and Ceccato [
70] proposed approaches based on program tracing to automatically mine concerns. Moldovan and Şerban [
53] proposed a clustering-based approach. All of these are semi-automatic or manually driven approaches to feature identification, whereas
fold2vec is automatic; however, apart from this aspect, they share the same goal.
Machine learning on code. Allamanis and Sutton [
4] presented an early analysis of source code based on n-gram language models. Their results have been used to build a Java-source code dataset that has been used as the basis to build the
java-large dataset we use. From the same research group, graph NNs were applied for code generation and representation learning [
2,
14]. The task of code generation is tackled in the work of Hu et al. [
33] using transformer-generated representation from AST paths. For the same task, a massive transformer language model developed in the work of Chen et al. [
17] achieved outstanding results. CodeBERT uses a large transformer language model for the task of mask language modeling [
25]. Pre-trained models of the size of CodeBERT usually have high training costs. However, these models can be generalized to several tasks through fine-tuning with lower cost (both in terms of data and hardware). Usually, these large models are resource heavy and are accessible through online APIs. The main benefit of this approach is that it leads to better results. Again, for representation learning, Wang et al. [
75,
76] used graph NNs and program traces to achieve the same scope. Instead, Ben-Nun et al. [
11] used recurrent architectures trained from features extracted from the LLVM representation of source code. Raychev et al. [
65] exploited CRF on a dependency network built from JavaScript code to automatically predict program properties such as type annotation and variable identifiers. Their approach led to
JSNice, which predicts correct name identifiers in
\(63\%\) of the cases and correct type annotations in
\(81\%\) of the cases. Hu et al. [
34], Chen and Zhou [
18], and Movshovitz-Attias and Cohen [
54] developed autoregressive models to automatically generate comments from source code. Iyer et al. [
37] used attention networks to generate natural language descriptions from code, and Haiduc et al. [
28] shared the same goal. Jiang et al. [
40] used NN to automatically generate commit messages from diffs. Piech et al. [
61] translated programs into real valued embeddings. Chen et al. [
19] used tree decoders and encoders to translate programs from one language to another. Oda et al. [
57] tried to translate formal code into pseudo code. Another, interesting research field applies machine learning techniques to spot bugs, as in the work of Dam et al. [
22] and Shi et al. [
68]. Related to bug detection, much work focuses on program repair. For example, Jiang et al. [
39] used a transformer-based architecture (GPT) to pre-train a large language model for the task. For the same task, a recurrent neural architecture was used by Wang et al. [
73]. All of these approaches exploit code representation and machine learning, and represent potential application domains for
fold2vec. A different type of code representation based on code updates was developed by Hoang et al. [
32]. Last, Kang et al. [
41], Keim et al. [
43], and Rabin et al. [
63] assessed generalizability of several deep learning baselines as
code2vec,
code2seq, and CodeBERT.