CoCoSoDa: Effective Contrastive Learning for Code Search

被引:16
|
作者
Shi, Ensheng [1 ]
Wang, Yanlin [2 ]
Gu, Wenchao [3 ]
Du, Lun [4 ]
Zhang, Hongyu [5 ]
Han, Shi [4 ]
Zhang, Dongmei [4 ]
Sun, Hongbin [1 ]
机构
[1] Xi An Jiao Tong Univ, Xian, Peoples R China
[2] Sun Yat Sen Univ, Schoo Software Engn, Guangzhou, Peoples R China
[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[4] Microsoft Res, Beijing, Peoples R China
[5] Chongqing Univ, Chongqing, Peoples R China
基金
国家重点研发计划;
关键词
code search; contrastive learning; soft data augmentation; momentum mechanism; COMPLETION;
D O I
10.1109/ICSE48619.2023.00185
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Code search aims to retrieve semantically relevant code snippets for a given natural language query. Recently, many approaches employing contrastive learning have shown promising results on code representation learning and greatly improved the performance of code search. However, there is still a lot of room for improvement in using contrastive learning for code search. In this paper, we propose CoCoSoDa to effectively utilize contrastive learning for code search via two key factors in contrastive learning: data augmentation and negative samples. Specifically, soft data augmentation is to dynamically masking or replacing some tokens with their types for input sequences to generate positive samples. Momentum mechanism is used to generate large and consistent representations of negative samples in a mini-batch through maintaining a queue and a momentum encoder. In addition, multimodal contrastive learning is used to pull together representations of code-query pairs and push apart the unpaired code snippets and queries. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages. Experimental results show that: (1) CoCoSoDa outperforms 18 baselines and especially exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 13.3%, 10.5%, and 5.9% on average MRR scores, respectively. (2) The ablation studies show the effectiveness of each component of our approach. (3) We adapt our techniques to several different pre-trained models such as RoBERTa, CodeBERT, and GraphCodeBERT and observe a significant boost in their performance in code search. (4) Our model performs robustly under different hyper-parameters. Furthermore, we perform qualitative and quantitative analyses to explore reasons behind the good performance of our model.
引用
收藏
页码:2198 / 2210
页数:13
相关论文
共 50 条
  • [1] Effective Hard Negative Mining for Contrastive Learning-Based Code Search
    Fan, Ye
    Li, Chuanyi
    Ge, Jidong
    Huang, Liguo
    Luo, Bin
    ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2025, 34 (03)
  • [2] SCodeSearcher: soft contrastive learning for code search
    Li, Jia
    Fang, Zheng
    Shi, Xianjie
    Jin, Zhi
    Liu, Fang
    Li, Jia
    Zhao, Yunfei
    Li, Ge
    EMPIRICAL SOFTWARE ENGINEERING, 2025, 30 (03)
  • [3] Cross-Modal Contrastive Learning for Code Search
    Shi, Zejian
    Xiong, Yun
    Zhang, Xiaolong
    Zhang, Yao
    Li, Shanshan
    Zhu, Yangyong
    2022 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2022), 2022, : 94 - 105
  • [4] Improving Code Search with Multi-Modal Momentum Contrastive Learning
    Shi, Zejian
    Xiong, Yun
    Zhang, Yao
    Jiang, Zhijie
    Zhao, Jinjing
    Wang, Lei
    Li, Shanshan
    2023 IEEE/ACM 31ST INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC, 2023, : 280 - 291
  • [5] Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering
    Park, Shinwoo
    Kim, Youngwook
    Han, Yo-Sub
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 3609 - 3619
  • [6] Contrastive Code Representation Learning
    Jain, Paras
    Jain, Ajay
    Zhang, Tianjun
    Abbeel, Pieter
    Gonzalez, Joseph E.
    Stoica, Ion
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 5954 - 5971
  • [7] FMCS: Improving Code Search by Multi-Modal Representation Fusion and Momentum Contrastive Learning
    Liu, Wenjie
    Chen, Gong
    Xie, Xiaoyuan
    2024 IEEE 24TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, 2024, : 632 - 638
  • [8] Regularized Contrastive Learning of Semantic Search
    Tan, Mingxi
    Rolland, Alexis
    Tian, Andong
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT I, 2022, 13551 : 119 - 130
  • [9] Code Clone Detection Based on Contrastive Learning
    Xie, Chunli
    Liang, Yao
    Lv, Quanrun
    Wan, Zexuan
    2024 IEEE 4TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND ARTIFICIAL INTELLIGENCE, SEAI 2024, 2024, : 151 - 156
  • [10] CMCS: contrastive-metric learning via vector-level sampling and augmentation for code search
    Song, Qihong
    Hu, Haize
    Dai, Tebo
    SCIENTIFIC REPORTS, 2024, 14 (01):