Learning and Evaluating Contextual Embedding of Source Code

被引：0

作者：

Kanade, Aditya ^{[1
,2
]}

Maniatis, Petros ^{[2
]}

Balakrishnan, Gogul ^{[2
]}

Shi, Kensen ^{[2
]}

机构：

[1] Indian Inst Sci, Bangalore, Karnataka, India

[2] Google Brain, Mountain View, CA 94043 USA

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119 | 2020年 / 119卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.

引用

页数：12

共 50 条

[21] Source Code Analysis in Programming Education: Evaluating Learning Content with Self-Organizing Maps
Jevtic, Marko
Mladenovic, Sasa
Granic, Andrina
APPLIED SCIENCES-BASEL, 2023, 13 (09):
[22] Semantic Similarity Metrics for Evaluating Source Code Summarization
Haque, Sakib
Eberhart, Zachary
Bansal, Aakash
McMillan, Collin
arXiv, 2022,
[23] Semantic Similarity Metrics for Evaluating Source Code Summarization
Haque, Sakib
Eberhart, Zachary
Bansal, Aakash
McMillan, Collin
30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 36 - 47
[24] Semantic Similarity Metrics for Evaluating Source Code Summarization
Haque, Sakib
Eberhart, Zachary
Bansal, Aakash
McMillan, Collin
IEEE International Conference on Program Comprehension, 2022, 2022-March : 36 - 47
[25] Evaluating Source Code Summarization Techniques: Replication and Expansion
Eddy, Brian P.
Robinson, Jeffrey A.
Kraft, Nicholas A.
Carver, Jeffrey C.
2013 IEEE 21ST INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC), 2013, : 13 - 22
[26] Learning distributed word representation with multi-contextual mixed embedding
Li, Jianqiang
Li, Jing
Fu, Xianghua
Masud, M. A.
Huang, Joshua Zhexue
KNOWLEDGE-BASED SYSTEMS, 2016, 106 : 220 - 230
[27] A framework for measuring and evaluating program source code quality
Washizaki, Hironori
Namiki, Rieko
Fukuoka, Tomoyuki
Harada, Yoko
Watanabe, Hiroyuki
PRODUCT-FOCUSED SOFTWARE PROCESS IMPROVEMENT, PROCEEDINGS, 2007, 4589 : 284 - +
[28] Graph Neural Network contextual embedding for Deep Learning on tabular data
Villaizan-Vallelado, Mario
Salvatori, Matteo
Carro, Belen
Sanchez-Esguevillas, Antonio Javier
NEURAL NETWORKS, 2024, 173
[29] Cracking the Code: Understanding Source Code Plagiarism to Enable Learning
Smit, Imelda
du Plessis, Linda
SOFTWARE ENGINEERING METHODS DESIGN AND APPLICATION, VOL 1, CSOC 2024, 2024, 1118 : 312 - 326
[30] Evaluating Network Embedding Models for Machine Learning Tasks
Oluigbo, Ikenna
Haddad, Mohammed
Seba, Hamida
COMPLEX NETWORKS AND THEIR APPLICATIONS VIII, VOL 1, 2020, 881 : 915 - 927

← 1 2 3 4 5 →