[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3387904.3389269acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Improving Code Search with Co-Attentive Representation Learning

Published: 12 September 2020 Publication History

Abstract

Searching and reusing existing code from a large-scale codebase, e.g, GitHub, can help developers complete a programming task efficiently. Recently, Gu et al. proposed a deep learning-based model (i.e., DeepCS), which significantly outperformed prior models. The DeepCS embedded codebase and natural language queries into vectors by two LSTM (long and short-term memory) models separately, and returned developers the code with higher similarity to a code search query. However, such embedding method learned two isolated representations for code and query but ignored their internal semantic correlations. As a result, the learned isolated representations of code and query may limit the effectiveness of code search.
To address the aforementioned issue, we propose a co-attentive representation learning model, i.e., Co-Attentive Representation Learning Code Search-CNN (CARLCS-CNN). CARLCS-CNN learns interdependent representations for the embedded code and query with a co-attention mechanism. Generally, such mechanism learns a correlation matrix between embedded code and query, and co-attends their semantic relationship via row/column-wise max-pooling. In this way, the semantic correlation between code and query can directly affect their individual representations. We evaluate the effectiveness of CARLCS-CNN on Gu et al.'s dataset with 10k queries. Experimental results show that the proposed CARLCS-CNN model significantly outperforms DeepCS by 26.72% in terms of MRR (mean reciprocal rank). Additionally, CARLCS-CNN is five times faster than DeepCS in model training and four times in testing.

References

[1]
2019. GitHub. Retrieved November 5, 2019 from https://www.github.com
[2]
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 451--462.
[3]
Dzmitry Bahdanau, Kyunghyun Cho, and Y. Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. computer science 1409 (09 2014).
[4]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.
[5]
Joel Brandt, Mira Dontcheva, Marcos Weskamp, and Scott R Klemmer. 2010. Example-centric programming: integrating web search into the development environment. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 513--522.
[6]
Joel Brandt, Philip J Guo, Joel Lewenstein, Mira Dontcheva, and Scott R Klemmer. 2009. Two studies of opportunistic programming: interleaving web foraging, learning, and writing code. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1589--1598.
[7]
Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 964--974.
[8]
Claudio Carpineto and Giovanni Romano. 2012. A Survey of Automatic Query Expansion in Information Retrieval. ACM Comput. Surv. 44, 1, Article 1 (Jan. 2012), 50 pages. https://doi.org/10.1145/2071389.2071390
[9]
Wing-Kwan Chan, Hong Cheng, and David Lo. 2012. Searching Connected API Subgraph via Text Phrases. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE '12). ACM, New York, NY, USA, Article 10, 11 pages. https://doi.org/10.1145/2393596.2393606
[10]
Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2017. Word translation without parallel data. (2017).
[11]
G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais. 1987. The Vocabulary Problem in Human-system Communication. Commun. ACM 30, 11 (Nov. 1987), 964--971. https://doi.org/10.1145/32206.32212
[12]
Edouard Grave, Armand Joulin, and Quentin Berthet. 2018. Unsupervised alignment of embeddings with Wasserstein Procrustes. (2018).
[13]
Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. 2015. DRAW: A Recurrent Neural Network For Image Generation. CoRR abs/1502.04623 (2015). arXiv:1502.04623
[14]
Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 631--642.
[15]
Sonia Haiduc, Gabriele Bavota, Andrian Marcus, Rocco Oliveto, Andrea De Lucia, and Tim Menzies. 2013. Automatic query reformulations for text retrieval in software engineering. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 842--851.
[16]
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in neural information processing systems. 1693--1701.
[17]
Emily Hill, Manuel R.Vega, Jerry A.Fails, and Greg Mallet. 2014. NL-based query refinement and contextualized code search results: A user study. In 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). 34--43. https://doi.org/10.1109/CSMR-WCRE.2014.6747190
[18]
Raphael Hoffmann, James Fogarty, and Daniel S Weld. 2007. Assieme: finding and leveraging implicit references in a web search interface for programmers. In Proceedings of the 20th annual ACM symposium on User interface software and technology. ACM, 13--22.
[19]
Reid Holmes, Rylan Cottrell, Robert J Walker, and Jorg Denzinger. 2009. The end-to-end use of source code examples: An exploratory study. In 2009 IEEE International Conference on Software Maintenance. IEEE, 555--558.
[20]
Xuan Huo, Ming Li, and Zhi-Hua Zhou. 2016. Learning Unified Features from Natural and Programming Languages for Locating Buggy Source Code. In IJCAI. 1606--1612.
[21]
Hamel Husain and Ho-Hsiang Wu. 2018. How to create natural language semantic search for arbitrary objects with deep learning. Retrieved November 5, 2019 from https://towardsdatascience.com/semantic-code-search-3cd6d244a39c
[22]
Hamel Husain and Ho-Hsiang Wu. 2018. Towards natural language semantic code search. Retrieved November 5, 2019 from https://githubengineering.com/towards-natural-languagesemantic-code-search/
[23]
Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering. ACM, 664--675.
[24]
Katja Kevic and Thomas Fritz. 2014. Automatic search term identification for change tasks. In Companion Proceedings of the 36th International Conference on Software Engineering. ACM, 468--471.
[25]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[26]
Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.
[27]
Lei Li, Ruihai Dong, and Li Chen. 2019. Context-Aware Co-attention Neural Network for Service Recommendations. In 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW). IEEE, 201--208.
[28]
Erik Linstead, Sushil Bajracharya, Trung Ngo, Paul Rigor, Cristina Lopes, and Pierre Baldi. 2009. Sourcerer: mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery 18, 2 (2009), 300--336.
[29]
Hong Liu, Rongrong Ji, Yongjian Wu, and Wei Liu. 2016. Towards optimal binary code learning via ordinal embedding. In Thirtieth AAAI Conference on Artificial Intelligence.
[30]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical Question-Image Co-Attention for Visual Question Answering. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 289--297. http://papers.nips.cc/paper/6202-hierarchical-question-image-co-attention-for-visual-question-answering.pdf
[31]
Meili Lu, Xiaobing Sun, Shaowei Wang, David Lo, and Yucong Duan. 2015. Query expansion via WordNet for effective code search. 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2015 - Proceedings (04 2015), 545--549. https://doi.org/10.1109/SANER.2015.7081874
[32]
Fei Lv, Hongyu Zhang, Jian-guang Lou, Shaowei Wang, Dongmei Zhang, and Jianjun Zhao. 2015. Codehow: Effective code search based on api understanding and extended boolean model (e). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 260--270.
[33]
Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. 2011. Portfolio: Finding relevant functions and their usage. Proceedings - International Conference on Software Engineering, 111--120. https://doi.org/10.1145/1985793.1985809
[34]
Volodymyr Mnih, Nicolas Heess, Alex Graves, and koray kavukcuoglu. 2014. Recurrent Models of Visual Attention. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2204--2212. http://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf
[35]
Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. 2017. Bayesian sketch learning for program synthesis. CoRR, abs/1703.05698 (2017).
[36]
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual Attention Networks for Multimodal Reasoning and Matching. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[37]
Andy Nguyen, Christopher Piech, Jonathan Huang, and Leonidas Guibas. 2014. Codewebs: scalable homework search for massive open online programming courses. In Proceedings of the 23rd international conference on World wide web. ACM, 491--502.
[38]
Haoran Niu, Iman Keivanloo, and Ying Zou. 2017. Learning to rank code examples for code search engines. Empirical Software Engineering 22, 1 (2017), 259--291.
[39]
Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. Swim: Synthesizing what i mean-code search and idiomatic snippet synthesis. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, 357--367.
[40]
Md Masudur Rahman, Jed Barson, Sydney Paul, Joshua Kayani, Federico Andrés Lois, Sebastián Fernandez Quezada, Christopher Parnin, Kathryn T Stolee, and Baishakhi Ray. 2018. Evaluating how developers use general-purpose web-search for code retrieval. In Proceedings of the 15th International Conference on Mining Software Repositories. ACM, 465--475.
[41]
Mohammad M. Rahman, Chanchal K. Roy, and David Lo. 2019. Automatic query reformulation for code search using crowdsourced knowledge. Empirical Software Engineering 24, 4 (01 Aug 2019), 1869--1924. https://doi.org/10.1007/s10664-018-9671-0
[42]
Steven P. Reiss. 2009. Semantics-based Code Search. In Proceedings of the 31st International Conference on Software Engineering (ICSE '09). IEEE Computer Society, Washington, DC, USA, 243--253. https://doi.org/10.1109/ICSE.2009.5070525
[43]
Martin P Robillard. 2009. What makes APIs hard to learn? Answers from developers. IEEE software 26, 6 (2009), 27--34.
[44]
Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočisky, and Phil Blunsom. 2015. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664 (2015).
[45]
Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. 2018. Retrieval on source code: a neural code search. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 31--41.
[46]
Caitlin Sadowski, Kathryn T. Stolee, and Sebastian Elbaum. 2015. How Developers Search for Code: A Case Study. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). ACM, New York, NY, USA, 191--201. https://doi.org/10.1145/2786805.2786855
[47]
Kevin J. Shih, Saurabh Singh, and Derek Hoiem. 2016. Where to Look: Focus Regions for Visual Question Answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[48]
Marijn F Stollenga, Jonathan Masci, Faustino Gomez, and Jürgen Schmidhuber. 2014. Deep Networks with Internal Selective Attention through Feedback Connections. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3545--3553. http://papers.nips.cc/paper/5276-deep-networks-with-internal-selective-attention-through-feedback-connections.pdf
[49]
Jeffrey Stylos and Brad A. Myers. 2006. Mica: A Web-Search Tool for Finding API Components and Examples. In 2006 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC 2006), 4-8 September 2006, Brighton, UK.
[50]
Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip Yu. 2019. Multi-modal attention network learning for semantic source code retrieval. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 13--25.
[51]
Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S Yu. 2018. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE). 397--407.
[52]
Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. 2013. A theoretical analysis of NDCG type ranking measures. In Conference on Learning Theory. 25--54.
[53]
Bolin Wei, Ge Li, Xin Xia, Zhiyi Fu, and Zhi Jin. 2019. Code Generation as a Dual Task of Code Summarization. In Advances in Neural Information Processing Systems. 6559--6569.
[54]
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In the 31st IEEE/ACM International Conference.
[55]
Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Breakthroughs in statistics. Springer, 196--202.
[56]
Gu Xiaodong, Zhang Hongyu, and Kim Sunghun. 2018. Deep Code Search. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). 933--944. https://doi.org/10.1145/3180155.3180167
[57]
Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. (2016), 2397--2406.
[58]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked Attention Networks for Image Question Answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[59]
Xuchen Yao, Benjamin Van Durme, and Peter Clark. 2013. Automatic Coupling of Answer Extraction and Information Retrieval. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Sofia, Bulgaria, 159--165. https://www.aclweb.org/anthology/P13-2029
[60]
Ziyu Yao, Jayavardhan Reddy Peddamail, and Huan Sun. 2019. CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning. In The World Wide Web Conference. ACM, 2203--2214.
[61]
Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. 2017. Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:1702.01923 (2017).
[62]
Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. 2016. Abcnn: Attention-based convolutional neural network for modeling sentence pairs. Transactions of the Association for Computational Linguistics 4 (2016), 259--272.
[63]
Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018. Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering. IEEE Transactions on Neural Networks and Learning Systems 29 (04 2018), 5947--5959. https://doi.org/10.1109/TNNLS.2018.2817340
[64]
Feng Zhang, Haoran Niu, Iman Keivanloo, and Ying Zou. 2017. Expanding queries for code search using semantically related api class-names. IEEE Transactions on Software Engineering 44, 11 (2017), 1070--1082.

Cited By

View all
  • (2025)An intent-enhanced feedback extension model for code searchInformation and Software Technology10.1016/j.infsof.2024.107589177(107589)Online publication date: Jan-2025
  • (2024)I2R: Intra and inter-modal representation learning for code searchIntelligent Data Analysis10.3233/IDA-23008228:3(807-823)Online publication date: 28-May-2024
  • (2024)A Combined Alignment Model for Code SearchIEICE Transactions on Information and Systems10.1587/transinf.2023MPP0002E107.D:3(257-267)Online publication date: 1-Mar-2024
  • Show More Cited By

Index Terms

  1. Improving Code Search with Co-Attentive Representation Learning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICPC '20: Proceedings of the 28th International Conference on Program Comprehension
    July 2020
    481 pages
    ISBN:9781450379588
    DOI:10.1145/3387904
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 September 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. co-attention mechanism
    2. code search
    3. representation learning

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICPC '20
    Sponsor:

    Upcoming Conference

    ICSE 2025

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)64
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 12 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)An intent-enhanced feedback extension model for code searchInformation and Software Technology10.1016/j.infsof.2024.107589177(107589)Online publication date: Jan-2025
    • (2024)I2R: Intra and inter-modal representation learning for code searchIntelligent Data Analysis10.3233/IDA-23008228:3(807-823)Online publication date: 28-May-2024
    • (2024)A Combined Alignment Model for Code SearchIEICE Transactions on Information and Systems10.1587/transinf.2023MPP0002E107.D:3(257-267)Online publication date: 1-Mar-2024
    • (2024)Intelligent code search aids edge software developmentJournal of Cloud Computing10.1186/s13677-024-00629-513:1Online publication date: 1-Apr-2024
    • (2024)Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code SearchProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686664(1-12)Online publication date: 24-Oct-2024
    • (2024)An Empirical Study on Code Search Pre-trained Models: Academic Progresses vs. Industry RequirementsProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3672580(41-50)Online publication date: 24-Jul-2024
    • (2024)An Empirical Study of Code Search in Intelligent Coding Assistant: Perceptions, Expectations, and DirectionsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663848(283-293)Online publication date: 10-Jul-2024
    • (2024)A Survey of Source Code Search: A 3-Dimensional PerspectiveACM Transactions on Software Engineering and Methodology10.1145/365634133:6(1-51)Online publication date: 28-Jun-2024
    • (2024)RAPID: Zero-Shot Domain Adaptation for Code Search with Pre-Trained ModelsACM Transactions on Software Engineering and Methodology10.1145/364154233:5(1-35)Online publication date: 18-Jan-2024
    • (2024)Fusing Code SearchersIEEE Transactions on Software Engineering10.1109/TSE.2024.340304250:7(1852-1866)Online publication date: 1-Jul-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media