Learning and Evaluating Contextual Embedding of Source Code

被引：0

作者：

Kanade, Aditya ^{[1
,2
]}

Maniatis, Petros ^{[2
]}

Balakrishnan, Gogul ^{[2
]}

Shi, Kensen ^{[2
]}

机构：

[1] Indian Inst Sci, Bangalore, Karnataka, India

[2] Google Brain, Mountain View, CA 94043 USA

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119 | 2020年 / 119卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.

引用

页数：12

共 50 条

[31] Towards Optimal Binary Code Learning via Ordinal Embedding
Liu, Hong
Ji, Rongrong
Wu, Yongjian
Liu, Wei
THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, : 1258 - 1265
[32] Improving source code suggestion with code embedding and enhanced convolutional long short-term memory
Hussain, Yasir
Huang, Zhiqiu
Zhou, Yu
IET SOFTWARE, 2021, 15 (03) : 199 - 213
[33] LEARNING TO LAUNCH ENTERPRISES WITH OPEN SOURCE CODE
Dehter, Mario
INGENIERIA SOLIDARIA, 2015, 11 (18): : 9 - 21
[34] The Growing Cost of Deep Learning for Source Code
Hellendoorn, Vincent J.
Sawant, Anand Ashok
COMMUNICATIONS OF THE ACM, 2022, 65 (01) : 31 - 33
[35] A Deep Learning Model for Source Code Generation
Tiwang, Raymond
Oladunni, Timothy
Xu, Weifeng
2019 IEEE SOUTHEASTCON, 2019,
[36] COCLUBERT: Clustering Machine Learning Source Code
Hagglund, Marcus
Pena, Francisco J.
Pashami, Sepideh
Al-Shishtawy, Ahmad
Payberah, Amir H.
20TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2021), 2021, : 151 - 158
[37] Source Code Editing Evaluator for Learning Programming
Chandra, Timotius Nugroho
Liem, Inggriani
4TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICEEI 2013), 2013, 11 : 169 - 175
[38] Deep Transfer Learning for Source Code Modeling
Hussain, Yasir
Huang, Zhiqiu
Zhou, Yu
Wang, Senzhang
INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2020, 30 (05) : 649 - 668
[39] Clinical concept annotation with contextual word embedding in active transfer learning environment
Abbas, Asim
Lee, Mark
Shanavas, Niloofer
Kovatchev, Venelin
DIGITAL HEALTH, 2024, 10
[40] Evaluating the Performance of LSA for Source-code Plagiarism Detection
Cosma, Georgina
Joy, Mike
INFORMATICA-JOURNAL OF COMPUTING AND INFORMATICS, 2012, 36 (04): : 409 - 424

← 1 2 3 4 5 →