Prе-trаining Mеthоd fоr Enhаnсеd Соdе Rерrеsеntаtiоn Ваsеd оn Multimоdаl Соntrаstivе Lеаrning

被引:0
|
作者
Yang H.-Y. [1 ,2 ,4 ]
Ma J.-H. [1 ,2 ,4 ]
Hou M. [1 ,3 ,4 ]
Shen S.-H. [1 ,3 ,4 ]
Chen E.-H. [1 ,3 ,4 ]
机构
[1] Anhui Province Key Laboratory of Big Data Analysis and Application, University of Science and Technology of China, Hefei
[2] School of Computer Science and Technology, University of Science and Technology of China, Hefei
[3] School of Data Science, University of Science and Technology of China, Hefei
[4] State Key Laboratory of Cognitive Intelligence, Hefei
来源
Ruan Jian Xue Bao/Journal of Software | 2024年 / 35卷 / 04期
关键词
code representation; contrastive learning; multimodal; pre-trained model;
D O I
10.13328/j.cnki.jos.007016
中图分类号
学科分类号
摘要
Code representation aims to extract the characteristics of source code to obtain its semantic embedding, playing a crucial role in deep learning-based code intelligence. Traditional handcrafted code representation methods mainly rely on domain expert annotations, which are time-consuming and labor-intensive. Moreover, the obtained code representations are task-specific and not easily reusable for specific downstream tasks, which contradicts the green and sustainable development concept. To this end, many large-scale pretraining models for source code representation have shown remarkable success in recent years. These methods utilize massive source code for self-supervised learning to obtain universal code representations, which are then easily fine-tuned for various downstream tasks. Based on the abstraction levels of programming languages, code representations have four level features: text level, semantic level, functional level, and structural level. Nevertheless, current models for code representation treat programming languages merely as ordinary text sequences resembling natural language. They overlook the functional-level and structural-level features, which bring performance inferior. To overcome this drawback, this study proposes a representation enhanced contrastive multimodal pretraining (REcomp) framework for code representation pretraining. REcomp has developed a novel semantic-level to structure-level feature fusion algorithm, which is employed for serializing abstract syntax trees. Through a multi-modal contrastive learning approach, this composite feature is integrated with both the textual and functional features of programming languages, enabling a more precise semantic modeling. Extensive experiments are conducted on three real-world public datasets. Experimental results clearly validate the superiority of REcomp. © 2024 Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:1601 / 1617
页数:16
相关论文
共 47 条
  • [1] Rey SJ., Big code, Geographical Analysis, 55, 2, pp. 211-224, (2023)
  • [2] Lu S, Guo D, Ren S, Et al., CodeXGLUE: A machine learning benchmark dataset for code understanding and generation, (2021)
  • [3] Cheng SQ, Liu JX, Peng ZL, Et al., CodeBERT based code classification method, Computer Engineering and Applications, 59, 24, pp. 277-288, (2023)
  • [4] Jiang Y, Li M, Zhou ZH., Software defect detection with Rocus, Journal of Computer Science and Technology, 26, 2, pp. 328-342, (2011)
  • [5] Jiang L, Misherghi G, Su Z, Et al., Deckard: Scalable and accurate tree-based detection of code clones, Proc. of the 29th Int’l Conf. on Software Engineering (ICSE 2007), pp. 96-105, (2007)
  • [6] Russell R, Kim L, Hamilton L, Et al., Automated vulnerability detection in source code using deep representation learning, Proc. of the 17th IEEE Int’l Conf. on Machine Learning and Applications (ICMLA), pp. 757-762, (2018)
  • [7] Zhou ZH, Chen SF., Neural network ensemble, Chinese Journal of Computer, 25, 1, pp. 1-8, (2002)
  • [8] Hindle A, Barr ET, Gabel M, Et al., On the naturalness of software, Communications of the ACM, 59, 5, pp. 122-131, (2016)
  • [9] Mou L, Li G, Jin Z, Et al., TBCNN: A tree-based convolutional neural network for programming language processing, (2014)
  • [10] Shuai J, Xu L, Liu C, Et al., Improving code search with co-attentive representation learning, Proc. of the 28th Int’l Conf. on Program Comprehension, pp. 196-207, (2020)