XCODE: Towards Cross-Language Code Representation with Large-Scale Pre-Training

被引:5
|
作者
Lin, Zehao [1 ]
Li, Guodun [1 ]
Zhang, Jingfeng [1 ]
Deng, Yue [1 ]
Zeng, Xiangji [1 ]
Zhang, Yin [1 ]
Wan, Yao [2 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, 38 Zheda Rd, Hangzhou 310027, Zhejiang, Peoples R China
[2] Huazhong Univ Sci & Technol, Sch Comp Sci & Tech, Wuhan 430027, Hubei, Peoples R China
基金
中国国家自然科学基金;
关键词
Deep learning; neural networks; code representation; cross-language; pre-training; LEARN;
D O I
10.1145/3506696
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Source code representation learning is the basis of applying artificial intelligence to many software engineering tasks such as code clone detection, algorithm classification, and code summarization. Recently, many works have tried to improve the performance of source code representation from various perspectives, e.g., introducing the structural information of programs into latent representation. However, when dealing with rapidly expanded unlabeled cross-language source code datasets from the Internet, there are still two issues. Firstly, deep learning models for many code-specific tasks still suffer from the lack of high-quality labels. Secondly, the structural differences among programming languages make it more difficult to process multiple languages in a single neural architecture. To address these issues, in this article, we propose a novel Cross-language Code representation with a large-scale pre-training (XCoDE) method. Concretely, we propose to use several abstract syntax trees and ELMo-enhanced variational autoencoders to obtain multiple pre-trained source code language models trained on about 1.5 million code snippets. To fully utilize the knowledge across programming languages, we further propose a Shared Encoder-Decoder (SED) architecture which uses the multi-teacher single-student method to transfer knowledge from the aforementioned pre-trained models to the distilled SED. The pre-trained models and SED will cooperate to better represent the source code. For evaluation, we examine our approach on three typical downstream cross-language tasks, i.e., source code translation, code clone detection, and code-to-code search, on a real-world dataset composed of programming exercises with multiple solutions. Experimental results demonstrate the effectiveness of our proposed approach on cross-language code representations. Meanwhile, our approach performs significantly better than several code representation baselines on different downstream tasks in terms of multiple automatic evaluation metrics.
引用
收藏
页数:44
相关论文
共 50 条
  • [1] Automating Code Review Activities by Large-Scale Pre-training
    Li, Zhiyu
    Lu, Shuai
    Guo, Daya
    Duan, Nan
    Jannu, Shailesh
    Jenks, Grant
    Majumder, Deep
    Green, Jared
    Svyatkovskiy, Alexey
    Fu, Shengyu
    Sundaresan, Neel
    PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022, 2022, : 1035 - 1047
  • [2] Pre-training on Large-Scale Heterogeneous Graph
    Jiang, Xunqiang
    Jia, Tianrui
    Fang, Yuan
    Shi, Chuan
    Lin, Zhe
    Wang, Hui
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 756 - 766
  • [3] CanvasEmb: Learning Layout Representation with Large-scale Pre-training for Graphic Design
    Xie, Yuxi
    Huang, Danqing
    Wang, Jinpeng
    Lin, Chin-Yew
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4100 - 4108
  • [4] WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models
    Yuan, Sha
    Zhao, Hanyu
    Du, Zhengxiao
    Ding, Ming
    Liu, Xiao
    Cen, Yukuo
    Zou, Xu
    Yang, Zhilin
    Tang, Jie
    AI OPEN, 2021, 2 : 65 - 68
  • [5] Synthetic Augmentation with Large-Scale Unconditional Pre-training
    Ye, Jiarong
    Ni, Haomiao
    Jin, Peng
    Huang, Sharon X.
    Xue, Yuan
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT II, 2023, 14221 : 754 - 764
  • [6] UniXcoder: Unified Cross-Modal Pre-training for Code Representation
    Guo, Daya
    Lu, Shuai
    Duan, Nan
    Wang, Yanlin
    Zhou, Ming
    Yin, Jian
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7212 - 7225
  • [7] Pre-training Universal Language Representation
    Li, Yian
    Zhao, Hai
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 5122 - 5133
  • [8] LARGE-SCALE UNSUPERVISED PRE-TRAINING FOR END-TO-END SPOKEN LANGUAGE UNDERSTANDING
    Wang, Pengwei
    Wei, Liangchen
    Cao, Yong
    Xie, Jinghui
    Nie, Zaiqing
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7999 - 8003
  • [9] Learning meaningful representation of single-neuron morphology via large-scale pre-training
    Fan, Yimin
    Li, Yaxuan
    Zhong, Yunhua
    Hong, Liang
    Li, Lei
    Li, Yu
    BIOINFORMATICS, 2024, 40 : ii128 - ii136
  • [10] An Optimized Method for Large-Scale Pre-Training in Symbolic Music
    Liu, Shike
    Xu, Hongguang
    Xu, Ke
    Proceedings of the International Conference on Anti-Counterfeiting, Security and Identification, ASID, 2022, 2022-December : 105 - 109