[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Key-based data augmentation with curriculum learning for few-shot code search

Published: 20 November 2024 Publication History

Abstract

Given a natural language query, code search aims to find matching code snippets from a codebase. Recent works are mainly designed for mainstream programming languages with large amounts of training data. However, code search is also needed for domain-specific programming languages, which have fewer training data, and it is a heavy burden to label a large amount of training data for each domain-specific language. To this end, we propose DAFCS, a data augmentation framework with curriculum learning for few-shot code search tasks. Specifically, we first collect unlabeled codes in the same programming language as the original codes, which can provide additional semantic signals to the original codes. Second, we employ an occlusion-based method to identify key statements in code fragments. Third, we design a set of new key-based augmentation operations for the original codes. Finally, we use curriculum learning to reasonably schedule augmented samples for training well-performing models. We conduct retrieval experiments on a public dataset and find that DAFCS surpasses state-of-the-art methods by 5.42% and 5.05% in the Solidity and SQL domain-specific languages, respectively. Our study shows that DAFCS, which adopts data augmentation and curriculum learning strategies, can achieve promising performance in few-shot code search tasks.

References

[1]
Xie Y, Lin J, Dong H, Zhang L, and Wu Z Survey of code search based on deep learning ACM Trans Softw Eng Methodol 2023 33 2 1-42
[2]
Zhang F, Chen B, Zhang Y, Keung J, Liu J, Zan D, Mao Y, Lou J, Chen W (2023) Repocoder: Repository-level code completion through iterative retrieval and generation. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp. 2471–2484. Association for Computational Linguistics, Singapore. https://aclanthology.org/2023.emnlp-main.151
[3]
Li M, Yu H, Fan G, Zhou Z, and Huang J Classsum: a deep learning model for class-level code summarization Neural Comput Appl 2023 35 4 3373-3393
[4]
Chen C-F, Zain AM, and Zhou K-Q Definition, approaches, and analysis of code duplication detection (2006–2020): a critical review Neural Comput Appl 2022 34 23 20507-20537
[5]
Liu C, Xia X, Lo D, Gao C, Yang X, and Grundy J Opportunities and challenges in code search tools ACM Comput Surv (CSUR) 2021 54 9 1-40
[6]
Gu X, Zhang H, Kim S (2018) Deep code search. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 933–944. IEEE
[7]
Shuai J, Xu L, Liu C, Yan M, Xia X, Lei Y (2020) Improving code search with co-attentive representation learning. In: Proceedings of the 28th International Conference on Program Comprehension, pp. 196–207
[8]
Hu H, Liu J, Zhang X, Cao B, Cheng S, and Long T A mutual embedded self-attention network model for code search J Syst Softw 2023 198 111591
[9]
Di Grazia L and Pradel M Code search: a survey of techniques for finding code ACM Comput Surv 2023 55 11 1-31
[10]
Chai Y, Zhang H, Shen B, Gu X (2022) Cross-domain deep code search with few-shot learning. In: The 44th IEEE/ACM International Conference on Software Engineering (ICSE)
[11]
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR
[12]
Antoniou A, Edwards H, Storkey AJ (2019) How to train your MAML. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, New Orleans, LA, USA. https://openreview.net/forum?id=HJGven05Y7
[13]
Feng SY, Gangal V, Wei J, Chandar S, Vosoughi S, Mitamura T, Hovy E (2021) A survey of data augmentation approaches for NLP. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 968–988. Association for Computational Linguistics, Online. . https://aclanthology.org/2021.findings-acl.84
[14]
Bayer M, Kaufhold M-A, and Reuter C A survey on data augmentation for text classification ACM Comput Surv 2022 55 7 1-39
[15]
Yu S, Wang T, and Wang J Data augmentation by program transformation J Syst Softw 2022 190 111304
[16]
Wang X, Chen Y, and Zhu W A survey on curriculum learning IEEE Trans Pattern Anal Mach Intell 2021 44 9 4555-4576
[17]
Soviany P, Ionescu RT, Rota P, and Sebe N Curriculum learning: a survey Int J Comput Vis 2022 130 6 1526-1565
[18]
Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48
[19]
Craswell N Mean reciprocal rank 2009 Boston Springer
[20]
Xu L, Yang H, Liu C, Shuai J, Yan M, Lei Y, Xu Z (2021) Two-stage attention-based model for code search with textual and structural features. In: 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 342–353. IEEE
[21]
Yanxia Wu, Y.W. Bin Sun: A kind of source code searching functions method based on software cluster. CN106202206A. China National Intellectual Property Administration (2016)
[22]
Arasteh B Clustered design-model generation from a program source code using chaos-based metaheuristic algorithms Neural Comput Appl 2023 35 4 3283-3305
[23]
Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) Codebert: A pre-trained model for programming and natural languages. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020. Findings of ACL, vol. EMNLP 2020, pp. 1536–1547. Association for Computational Linguistics, Online.
[24]
Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S, Tufano M, Deng SK, Clement CB, Drain D, Sundaresan N, Yin J, Jiang D, Zhou M (2021) Graphcodebert: Pre-training code representations with data flow. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, Online. https://openreview.net/forum?id=jLoC4ez43PZ
[25]
Zeng C, Yu Y, Li S, Xia X, Wang Z, Geng M, Bai L, Dong W, and Liao X Degraphcs: embedding variable-based flow graph for neural code search ACM Trans Softw Eng Methodol 2023 32 2 1-27
[26]
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
[27]
Behl HS, Baydin AG, Torr PH (2019) Alpha maml: Adaptive model-agnostic meta-learning. arXiv preprint arXiv:1905.07435
[28]
Ye H-J, Chao W-L (2021) How to train your maml to excel in few-shot classification. arXiv preprint arXiv:2106.16245
[29]
Liu X, Zhou G, Kong M, Yin Z, Li X, Yin L, and Zheng W Developing multi-labelled corpus of twitter short texts: a semi-automatic method Systems 2023 11 8 390
[30]
Rabin MRI and Alipour MA Programtransformer: a tool for generating semantically equivalent transformed programs Softw Impacts 2022 14 100429
[31]
Li Y, Qi S, Gao C, Peng Y, Lo D, Xu Z, Lyu MR (2022) A closer look into transformer-based code intelligence through code transformation: Challenges and opportunities. arXiv preprint arXiv:2207.04285
[32]
Shi E, Gub W, Wang Y, Du L, Zhang H, Han S, Zhang D, Sun H (2023) Enhancing semantic code search with multimodal contrastive learning and soft data augmentation. In: The 45th International Conference on Software Engineering (ICSE)
[33]
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp. 818–833. Springer
[34]
Li J, Chen X, Hovy EH, Jurafsky D (2016) Visualizing and understanding neural models in NLP. In: Knight, K., Nenkova, A., Rambow, O. (eds.) NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 681–691. The Association for Computational Linguistics, San Diego, California, USA.
[35]
Arras L, Osman A, Müller K, Samek W (2019) Evaluating recurrent neural network explanations. In: Linzen, T., Chrupala, G., Belinkov, Y., Hupkes, D. (eds.) Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@ACL 2019, Florence, Italy, August 1, 2019, pp. 113–126. Association for Computational Linguistics, Florence, Italy.
[36]
Sharma M and Kaur P A comprehensive analysis of nature-inspired meta-heuristic techniques for feature selection problem Arch Comput Methods Eng 2021 28 1103-1127
[37]
Arasteh B, Abdi M, and Bouyer A Program source code comprehension by module clustering using combination of discretized gray wolf and genetic algorithms Adv Eng Softw 2022 173 103252
[38]
Guo S, Huang W, Zhang H, Zhuang C, Dong D, Scott MR, Huang D (2018) Curriculumnet: Weakly supervised learning from large-scale web images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 135–150
[39]
Platanios EA, Stretcu O, Neubig G, Póczos B, Mitchell TM (2019) Competence-based curriculum learning for neural machine translation. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 1162–1172. Association for Computational Linguistics, Minneapolis, MN, USA.
[40]
El-Bouri R, Eyre D, Watkinson P, Zhu T, Clifton D (2020) Student-teacher curriculum learning via reinforcement learning: Predicting hospital inpatient admission location. In: International Conference on Machine Learning, pp. 2848–2857. PMLR
[41]
Florensa C, Held D, Wulfmeier M, Zhang M, Abbeel P (2017) Reverse curriculum generation for reinforcement learning. In: Conference on Robot Learning, pp. 482–495. PMLR
[42]
Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 783–794. IEEE
[43]
Bai Y, Yang E, Han B, Yang Y, Li J, Mao Y, Niu G, and Liu T Understanding and improving early stopping for learning with noisy labels Adv Neural Inf Process Syst 2021 34 24392-24403
[44]
Husain H, Wu H-H, Gazit T, Allamanis M, Brockschmidt M (2019) Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436
[45]
Dannen C Solidity programming. Introducing ethereum and solidity: foundations of cryptocurrency and blockchain programming for beginners 2017 New York Springer 69-88
[46]
Katsogiannis-Meimarakis G and Koutrika G A survey on deep learning approaches for text-to-sql VLDB J 2023 32 4 905-936
[47]
Saracevic T (1995) Evaluation of evaluation in information retrieval. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 138–146
[48]
Ma E (2019) NLP Augmentation. https://github.com/makcedward/nlpaug
Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Neural Computing and Applications
Neural Computing and Applications  Volume 37, Issue 3
Jan 2025
556 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 20 November 2024
Accepted: 07 October 2024
Received: 26 September 2023

Author Tags

  1. Code search
  2. Data augmentation
  3. Curriculum learning
  4. Few shot learning

Author Tag

  1. Language, Communication and Culture
  2. Linguistics

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media