XCODE: Towards Cross-Language Code Representation with Large-Scale Pre-Training

被引：5

作者：

Lin, Zehao ^{[1
]}

Li, Guodun ^{[1
]}

Zhang, Jingfeng ^{[1
]}

Deng, Yue ^{[1
]}

Zeng, Xiangji ^{[1
]}

Zhang, Yin ^{[1
]}

Wan, Yao ^{[2
]}

机构：

[1] Zhejiang Univ, Coll Comp Sci & Technol, 38 Zheda Rd, Hangzhou 310027, Zhejiang, Peoples R China

[2] Huazhong Univ Sci & Technol, Sch Comp Sci & Tech, Wuhan 430027, Hubei, Peoples R China

来源：

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY | 2022年 / 31卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Deep learning; neural networks; code representation; cross-language; pre-training; LEARN;

D O I：

10.1145/3506696

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Source code representation learning is the basis of applying artificial intelligence to many software engineering tasks such as code clone detection, algorithm classification, and code summarization. Recently, many works have tried to improve the performance of source code representation from various perspectives, e.g., introducing the structural information of programs into latent representation. However, when dealing with rapidly expanded unlabeled cross-language source code datasets from the Internet, there are still two issues. Firstly, deep learning models for many code-specific tasks still suffer from the lack of high-quality labels. Secondly, the structural differences among programming languages make it more difficult to process multiple languages in a single neural architecture. To address these issues, in this article, we propose a novel Cross-language Code representation with a large-scale pre-training (XCoDE) method. Concretely, we propose to use several abstract syntax trees and ELMo-enhanced variational autoencoders to obtain multiple pre-trained source code language models trained on about 1.5 million code snippets. To fully utilize the knowledge across programming languages, we further propose a Shared Encoder-Decoder (SED) architecture which uses the multi-teacher single-student method to transfer knowledge from the aforementioned pre-trained models to the distilled SED. The pre-trained models and SED will cooperate to better represent the source code. For evaluation, we examine our approach on three typical downstream cross-language tasks, i.e., source code translation, code clone detection, and code-to-code search, on a real-world dataset composed of programming exercises with multiple solutions. Experimental results demonstrate the effectiveness of our proposed approach on cross-language code representations. Meanwhile, our approach performs significantly better than several code representation baselines on different downstream tasks in terms of multiple automatic evaluation metrics.

引用

页数：44

共 50 条

[1] Automating Code Review Activities by Large-Scale Pre-training
Li, Zhiyu
Lu, Shuai
Guo, Daya
Duan, Nan
Jannu, Shailesh
Jenks, Grant
Majumder, Deep
Green, Jared
Svyatkovskiy, Alexey
Fu, Shengyu
Sundaresan, Neel
PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022, 2022, : 1035 - 1047
[2] Pre-training on Large-Scale Heterogeneous Graph
Jiang, Xunqiang
Jia, Tianrui
Fang, Yuan
Shi, Chuan
Lin, Zhe
Wang, Hui
KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 756 - 766
[3] CanvasEmb: Learning Layout Representation with Large-scale Pre-training for Graphic Design
Xie, Yuxi
Huang, Danqing
Wang, Jinpeng
Lin, Chin-Yew
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4100 - 4108
[4] WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models
Yuan, Sha
Zhao, Hanyu
Du, Zhengxiao
Ding, Ming
Liu, Xiao
Cen, Yukuo
Zou, Xu
Yang, Zhilin
Tang, Jie
AI OPEN, 2021, 2 : 65 - 68
[5] Synthetic Augmentation with Large-Scale Unconditional Pre-training
Ye, Jiarong
Ni, Haomiao
Jin, Peng
Huang, Sharon X.
Xue, Yuan
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT II, 2023, 14221 : 754 - 764
[6] UniXcoder: Unified Cross-Modal Pre-training for Code Representation
Guo, Daya
Lu, Shuai
Duan, Nan
Wang, Yanlin
Zhou, Ming
Yin, Jian
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7212 - 7225
[7] Pre-training Universal Language Representation
Li, Yian
Zhao, Hai
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 5122 - 5133
[8] LARGE-SCALE UNSUPERVISED PRE-TRAINING FOR END-TO-END SPOKEN LANGUAGE UNDERSTANDING
Wang, Pengwei
Wei, Liangchen
Cao, Yong
Xie, Jinghui
Nie, Zaiqing
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7999 - 8003
[9] Learning meaningful representation of single-neuron morphology via large-scale pre-training
Fan, Yimin
Li, Yaxuan
Zhong, Yunhua
Hong, Liang
Li, Lei
Li, Yu
BIOINFORMATICS, 2024, 40 : ii128 - ii136
[10] An Optimized Method for Large-Scale Pre-Training in Symbolic Music
Liu, Shike
Xu, Hongguang
Xu, Ke
Proceedings of the International Conference on Anti-Counterfeiting, Security and Identification, ASID, 2022, 2022-December : 105 - 109

← 1 2 3 4 5 →