CoCoSoDa: Effective Contrastive Learning for Code Search

被引:16
|
作者
Shi, Ensheng [1 ]
Wang, Yanlin [2 ]
Gu, Wenchao [3 ]
Du, Lun [4 ]
Zhang, Hongyu [5 ]
Han, Shi [4 ]
Zhang, Dongmei [4 ]
Sun, Hongbin [1 ]
机构
[1] Xi An Jiao Tong Univ, Xian, Peoples R China
[2] Sun Yat Sen Univ, Schoo Software Engn, Guangzhou, Peoples R China
[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[4] Microsoft Res, Beijing, Peoples R China
[5] Chongqing Univ, Chongqing, Peoples R China
基金
国家重点研发计划;
关键词
code search; contrastive learning; soft data augmentation; momentum mechanism; COMPLETION;
D O I
10.1109/ICSE48619.2023.00185
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Code search aims to retrieve semantically relevant code snippets for a given natural language query. Recently, many approaches employing contrastive learning have shown promising results on code representation learning and greatly improved the performance of code search. However, there is still a lot of room for improvement in using contrastive learning for code search. In this paper, we propose CoCoSoDa to effectively utilize contrastive learning for code search via two key factors in contrastive learning: data augmentation and negative samples. Specifically, soft data augmentation is to dynamically masking or replacing some tokens with their types for input sequences to generate positive samples. Momentum mechanism is used to generate large and consistent representations of negative samples in a mini-batch through maintaining a queue and a momentum encoder. In addition, multimodal contrastive learning is used to pull together representations of code-query pairs and push apart the unpaired code snippets and queries. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages. Experimental results show that: (1) CoCoSoDa outperforms 18 baselines and especially exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 13.3%, 10.5%, and 5.9% on average MRR scores, respectively. (2) The ablation studies show the effectiveness of each component of our approach. (3) We adapt our techniques to several different pre-trained models such as RoBERTa, CodeBERT, and GraphCodeBERT and observe a significant boost in their performance in code search. (4) Our model performs robustly under different hyper-parameters. Furthermore, we perform qualitative and quantitative analyses to explore reasons behind the good performance of our model.
引用
收藏
页码:2198 / 2210
页数:13
相关论文
共 50 条
  • [11] CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection
    Dou, Shihan
    Wu, Yueming
    Jia, Haoxiang
    Zhou, Yuhao
    Liu, Yan
    Liu, Yang
    arXiv,
  • [12] Effective Search of Proteins in the DNA Code
    Zarzycki, Hubert
    Dobrosielski, Wojciech T.
    Ewald, Dawid
    Apiecionek, Lukasz
    UNCERTAINTY AND IMPRECISION IN DECISION MAKING AND DECISION SUPPORT: CROSS-FERTILIZATION, NEW MODELS, AND APPLICATIONS, 2018, 559 : 273 - 285
  • [13] HELoC: Hierarchical Contrastive Learning of Source Code Representation
    Wang, Xiao
    Wu, Qiong
    Zhang, Hongyu
    Lyu, Chen
    Jiang, Xue
    Zheng, Zhuoran
    Lyu, Lei
    Hu, Songlin
    30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 354 - 365
  • [14] CogCol: Code Graph-Based Contrastive Learning Model for Code Summarization
    Shi, Yucen
    Yin, Ying
    Yu, Mingqian
    Chu, Liangyu
    ELECTRONICS, 2024, 13 (10)
  • [15] FGCLR: An Effective Graph Contrastive Learning For Recommendation
    Wang, Hua-Wei
    Guo, Yi-Jing
    Weng, Wei
    Wu, Liu-Xi
    Yan, Zhong-Qi
    Journal of Network Intelligence, 2024, 9 (01): : 289 - 299
  • [16] Learning to rank code examples for code search engines
    Haoran Niu
    Iman Keivanloo
    Ying Zou
    Empirical Software Engineering, 2017, 22 : 259 - 291
  • [17] Learning to rank code examples for code search engines
    Niu, Haoran
    Keivanloo, Iman
    Zou, Ying
    EMPIRICAL SOFTWARE ENGINEERING, 2017, 22 (01) : 259 - 291
  • [18] RFMC-CS: a representation fusion based multi-view momentum contrastive learning framework for code search
    Chen, Gong
    Liu, Wenjie
    Xie, Xiaoyuan
    AUTOMATED SOFTWARE ENGINEERING, 2025, 32 (01)
  • [19] Person Search via Background and Foreground Contrastive Learning
    Tang, Qing
    Jo, Kang-Hyun
    2022 15TH INTERNATIONAL CONFERENCE ON HUMAN SYSTEM INTERACTION (HSI), 2022,
  • [20] CONCORD: Clone-Aware Contrastive Learning for Source Code
    Ding, Yangruibo
    Chakraborty, Saikat
    Buratti, Luca
    Pujar, Saurabh
    Morari, Alessandro
    Kaiser, Gail
    Ray, Baishakhi
    PROCEEDINGS OF THE 32ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2023, 2023, : 26 - 38