CoCoSoDa: Effective Contrastive Learning for Code Search

被引：16

作者：

Shi, Ensheng ^{[1
]}

Wang, Yanlin ^{[2
]}

Gu, Wenchao ^{[3
]}

Du, Lun ^{[4
]}

Zhang, Hongyu ^{[5
]}

Han, Shi ^{[4
]}

Zhang, Dongmei ^{[4
]}

Sun, Hongbin ^{[1
]}

机构：

[1] Xi An Jiao Tong Univ, Xian, Peoples R China

[2] Sun Yat Sen Univ, Schoo Software Engn, Guangzhou, Peoples R China

[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[4] Microsoft Res, Beijing, Peoples R China

[5] Chongqing Univ, Chongqing, Peoples R China

来源：

2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE | 2023年

基金：

国家重点研发计划;

关键词：

code search; contrastive learning; soft data augmentation; momentum mechanism; COMPLETION;

D O I：

10.1109/ICSE48619.2023.00185

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Code search aims to retrieve semantically relevant code snippets for a given natural language query. Recently, many approaches employing contrastive learning have shown promising results on code representation learning and greatly improved the performance of code search. However, there is still a lot of room for improvement in using contrastive learning for code search. In this paper, we propose CoCoSoDa to effectively utilize contrastive learning for code search via two key factors in contrastive learning: data augmentation and negative samples. Specifically, soft data augmentation is to dynamically masking or replacing some tokens with their types for input sequences to generate positive samples. Momentum mechanism is used to generate large and consistent representations of negative samples in a mini-batch through maintaining a queue and a momentum encoder. In addition, multimodal contrastive learning is used to pull together representations of code-query pairs and push apart the unpaired code snippets and queries. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages. Experimental results show that: (1) CoCoSoDa outperforms 18 baselines and especially exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 13.3%, 10.5%, and 5.9% on average MRR scores, respectively. (2) The ablation studies show the effectiveness of each component of our approach. (3) We adapt our techniques to several different pre-trained models such as RoBERTa, CodeBERT, and GraphCodeBERT and observe a significant boost in their performance in code search. (4) Our model performs robustly under different hyper-parameters. Furthermore, we perform qualitative and quantitative analyses to explore reasons behind the good performance of our model.

引用

页码：2198 / 2210

页数：13

共 50 条

[1] Effective Hard Negative Mining for Contrastive Learning-Based Code Search
Fan, Ye
Li, Chuanyi
Ge, Jidong
Huang, Liguo
Luo, Bin
ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2025, 34 (03)
[2] SCodeSearcher: soft contrastive learning for code search
Li, Jia
Fang, Zheng
Shi, Xianjie
Jin, Zhi
Liu, Fang
Li, Jia
Zhao, Yunfei
Li, Ge
EMPIRICAL SOFTWARE ENGINEERING, 2025, 30 (03)
[3] Cross-Modal Contrastive Learning for Code Search
Shi, Zejian
Xiong, Yun
Zhang, Xiaolong
Zhang, Yao
Li, Shanshan
Zhu, Yangyong
2022 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2022), 2022, : 94 - 105
[4] Improving Code Search with Multi-Modal Momentum Contrastive Learning
Shi, Zejian
Xiong, Yun
Zhang, Yao
Jiang, Zhijie
Zhao, Jinjing
Wang, Lei
Li, Shanshan
2023 IEEE/ACM 31ST INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC, 2023, : 280 - 291
[5] Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering
Park, Shinwoo
Kim, Youngwook
Han, Yo-Sub
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 3609 - 3619
[6] Contrastive Code Representation Learning
Jain, Paras
Jain, Ajay
Zhang, Tianjun
Abbeel, Pieter
Gonzalez, Joseph E.
Stoica, Ion
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 5954 - 5971
[7] FMCS: Improving Code Search by Multi-Modal Representation Fusion and Momentum Contrastive Learning
Liu, Wenjie
Chen, Gong
Xie, Xiaoyuan
2024 IEEE 24TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, 2024, : 632 - 638
[8] Regularized Contrastive Learning of Semantic Search
Tan, Mingxi
Rolland, Alexis
Tian, Andong
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT I, 2022, 13551 : 119 - 130
[9] Code Clone Detection Based on Contrastive Learning
Xie, Chunli
Liang, Yao
Lv, Quanrun
Wan, Zexuan
2024 IEEE 4TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND ARTIFICIAL INTELLIGENCE, SEAI 2024, 2024, : 151 - 156
[10] CMCS: contrastive-metric learning via vector-level sampling and augmentation for code search
Song, Qihong
Hu, Haize
Dai, Tebo
SCIENTIFIC REPORTS, 2024, 14 (01):

← 1 2 3 4 5 →