More Web Proxy on the site http://driver.im/

research-article

Key-based data augmentation with curriculum learning for few-shot code search

Authors:

Yuanyuan ShenAuthors Info & Claims

Neural Computing and Applications, Volume 37, Issue 3

Pages 1475 - 1490

https://doi.org/10.1007/s00521-024-10670-9

Published: 20 November 2024 Publication History

Abstract

Given a natural language query, code search aims to find matching code snippets from a codebase. Recent works are mainly designed for mainstream programming languages with large amounts of training data. However, code search is also needed for domain-specific programming languages, which have fewer training data, and it is a heavy burden to label a large amount of training data for each domain-specific language. To this end, we propose DAFCS, a data augmentation framework with curriculum learning for few-shot code search tasks. Specifically, we first collect unlabeled codes in the same programming language as the original codes, which can provide additional semantic signals to the original codes. Second, we employ an occlusion-based method to identify key statements in code fragments. Third, we design a set of new key-based augmentation operations for the original codes. Finally, we use curriculum learning to reasonably schedule augmented samples for training well-performing models. We conduct retrieval experiments on a public dataset and find that DAFCS surpasses state-of-the-art methods by 5.42% and 5.05% in the Solidity and SQL domain-specific languages, respectively. Our study shows that DAFCS, which adopts data augmentation and curriculum learning strategies, can achieve promising performance in few-shot code search tasks.

References

[1]

Xie Y, Lin J, Dong H, Zhang L, and Wu Z Survey of code search based on deep learning ACM Trans Softw Eng Methodol 2023 33 2 1-42

Digital Library

[2]

Zhang F, Chen B, Zhang Y, Keung J, Liu J, Zan D, Mao Y, Lou J, Chen W (2023) Repocoder: Repository-level code completion through iterative retrieval and generation. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp. 2471–2484. Association for Computational Linguistics, Singapore. https://aclanthology.org/2023.emnlp-main.151

[3]

Li M, Yu H, Fan G, Zhou Z, and Huang J Classsum: a deep learning model for class-level code summarization Neural Comput Appl 2023 35 4 3373-3393

Digital Library

[4]

Chen C-F, Zain AM, and Zhou K-Q Definition, approaches, and analysis of code duplication detection (2006–2020): a critical review Neural Comput Appl 2022 34 23 20507-20537

Digital Library

[5]

Liu C, Xia X, Lo D, Gao C, Yang X, and Grundy J Opportunities and challenges in code search tools ACM Comput Surv (CSUR) 2021 54 9 1-40

Digital Library

[6]

Gu X, Zhang H, Kim S (2018) Deep code search. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 933–944. IEEE

[7]

Shuai J, Xu L, Liu C, Yan M, Xia X, Lei Y (2020) Improving code search with co-attentive representation learning. In: Proceedings of the 28th International Conference on Program Comprehension, pp. 196–207

[8]

Hu H, Liu J, Zhang X, Cao B, Cheng S, and Long T A mutual embedded self-attention network model for code search J Syst Softw 2023 198 111591

Digital Library

[9]

Di Grazia L and Pradel M Code search: a survey of techniques for finding code ACM Comput Surv 2023 55 11 1-31

Digital Library

[10]

Chai Y, Zhang H, Shen B, Gu X (2022) Cross-domain deep code search with few-shot learning. In: The 44th IEEE/ACM International Conference on Software Engineering (ICSE)

[11]

Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR

[12]

Antoniou A, Edwards H, Storkey AJ (2019) How to train your MAML. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, New Orleans, LA, USA. https://openreview.net/forum?id=HJGven05Y7

[13]

Feng SY, Gangal V, Wei J, Chandar S, Vosoughi S, Mitamura T, Hovy E (2021) A survey of data augmentation approaches for NLP. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 968–988. Association for Computational Linguistics, Online. . https://aclanthology.org/2021.findings-acl.84

[14]

Bayer M, Kaufhold M-A, and Reuter C A survey on data augmentation for text classification ACM Comput Surv 2022 55 7 1-39

Digital Library

[15]

Yu S, Wang T, and Wang J Data augmentation by program transformation J Syst Softw 2022 190 111304

Digital Library

[16]

Wang X, Chen Y, and Zhu W A survey on curriculum learning IEEE Trans Pattern Anal Mach Intell 2021 44 9 4555-4576

[17]

Soviany P, Ionescu RT, Rota P, and Sebe N Curriculum learning: a survey Int J Comput Vis 2022 130 6 1526-1565

Digital Library

[18]

Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48

[19]

Craswell N Mean reciprocal rank 2009 Boston Springer

[20]

Xu L, Yang H, Liu C, Shuai J, Yan M, Lei Y, Xu Z (2021) Two-stage attention-based model for code search with textual and structural features. In: 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 342–353. IEEE

[21]

Yanxia Wu, Y.W. Bin Sun: A kind of source code searching functions method based on software cluster. CN106202206A. China National Intellectual Property Administration (2016)

[22]

Arasteh B Clustered design-model generation from a program source code using chaos-based metaheuristic algorithms Neural Comput Appl 2023 35 4 3283-3305

Digital Library

[23]

Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) Codebert: A pre-trained model for programming and natural languages. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020. Findings of ACL, vol. EMNLP 2020, pp. 1536–1547. Association for Computational Linguistics, Online.

[24]

Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S, Tufano M, Deng SK, Clement CB, Drain D, Sundaresan N, Yin J, Jiang D, Zhou M (2021) Graphcodebert: Pre-training code representations with data flow. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, Online. https://openreview.net/forum?id=jLoC4ez43PZ

[25]

Zeng C, Yu Y, Li S, Xia X, Wang Z, Geng M, Bai L, Dong W, and Liao X Degraphcs: embedding variable-based flow graph for neural code search ACM Trans Softw Eng Methodol 2023 32 2 1-27

Digital Library

[26]

Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

[27]

Behl HS, Baydin AG, Torr PH (2019) Alpha maml: Adaptive model-agnostic meta-learning. arXiv preprint arXiv:1905.07435

[28]

Ye H-J, Chao W-L (2021) How to train your maml to excel in few-shot classification. arXiv preprint arXiv:2106.16245

[29]

Liu X, Zhou G, Kong M, Yin Z, Li X, Yin L, and Zheng W Developing multi-labelled corpus of twitter short texts: a semi-automatic method Systems 2023 11 8 390

[30]

Rabin MRI and Alipour MA Programtransformer: a tool for generating semantically equivalent transformed programs Softw Impacts 2022 14 100429

[31]

Li Y, Qi S, Gao C, Peng Y, Lo D, Xu Z, Lyu MR (2022) A closer look into transformer-based code intelligence through code transformation: Challenges and opportunities. arXiv preprint arXiv:2207.04285

[32]

Shi E, Gub W, Wang Y, Du L, Zhang H, Han S, Zhang D, Sun H (2023) Enhancing semantic code search with multimodal contrastive learning and soft data augmentation. In: The 45th International Conference on Software Engineering (ICSE)

[33]

Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp. 818–833. Springer

[34]

Li J, Chen X, Hovy EH, Jurafsky D (2016) Visualizing and understanding neural models in NLP. In: Knight, K., Nenkova, A., Rambow, O. (eds.) NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 681–691. The Association for Computational Linguistics, San Diego, California, USA.

[35]

Arras L, Osman A, Müller K, Samek W (2019) Evaluating recurrent neural network explanations. In: Linzen, T., Chrupala, G., Belinkov, Y., Hupkes, D. (eds.) Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@ACL 2019, Florence, Italy, August 1, 2019, pp. 113–126. Association for Computational Linguistics, Florence, Italy.

[36]

Sharma M and Kaur P A comprehensive analysis of nature-inspired meta-heuristic techniques for feature selection problem Arch Comput Methods Eng 2021 28 1103-1127

[37]

Arasteh B, Abdi M, and Bouyer A Program source code comprehension by module clustering using combination of discretized gray wolf and genetic algorithms Adv Eng Softw 2022 173 103252

Digital Library

[38]

Guo S, Huang W, Zhang H, Zhuang C, Dong D, Scott MR, Huang D (2018) Curriculumnet: Weakly supervised learning from large-scale web images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 135–150

[39]

Platanios EA, Stretcu O, Neubig G, Póczos B, Mitchell TM (2019) Competence-based curriculum learning for neural machine translation. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 1162–1172. Association for Computational Linguistics, Minneapolis, MN, USA.

[40]

El-Bouri R, Eyre D, Watkinson P, Zhu T, Clifton D (2020) Student-teacher curriculum learning via reinforcement learning: Predicting hospital inpatient admission location. In: International Conference on Machine Learning, pp. 2848–2857. PMLR

[41]

Florensa C, Held D, Wulfmeier M, Zhang M, Abbeel P (2017) Reverse curriculum generation for reinforcement learning. In: Conference on Robot Learning, pp. 482–495. PMLR

[42]

Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 783–794. IEEE

[43]

Bai Y, Yang E, Han B, Yang Y, Li J, Mao Y, Niu G, and Liu T Understanding and improving early stopping for learning with noisy labels Adv Neural Inf Process Syst 2021 34 24392-24403

[44]

Husain H, Wu H-H, Gazit T, Allamanis M, Brockschmidt M (2019) Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436

[45]

Dannen C Solidity programming. Introducing ethereum and solidity: foundations of cryptocurrency and blockchain programming for beginners 2017 New York Springer 69-88

[46]

Katsogiannis-Meimarakis G and Koutrika G A survey on deep learning approaches for text-to-sql VLDB J 2023 32 4 905-936

Digital Library

[47]

Saracevic T (1995) Evaluation of evaluation in information retrieval. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 138–146

[48]

Ma E (2019) NLP Augmentation. https://github.com/makcedward/nlpaug

Index terms have been assigned to the content through auto-classification.

Recommendations

When deep learning met code search
ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

There have been multiple recent proposals on using deep neural networks for code search using natural language. Common across these proposals is the idea of embedding code and natural language queries into real vectors and then using vector distance to ...
Few-shot partial multi-label learning with synthetic features network
Abstract
In partial multi-label learning (PML) problems, each training sample is partially annotated with a candidate label set, among which only a subset of labels are valid. The major hardship for PML is that its training procedure is prone to be misled ...
Specialized model initialization and architecture optimization for few-shot code search
Abstract Context:
Code search aims to find relevant code snippets from a codebase given a natural language query. It not only boosts developer efficiency but also improves the performance of tasks such as code generation and program repair, thus becoming ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Neural Computing and Applications

Neural Computing and Applications Volume 37, Issue 3

Jan 2025

556 pages

EISSN:1433-3058

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 20 November 2024

Accepted: 07 October 2024

Received: 26 September 2023

Author Tags

Author Tag

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents