Learning and Evaluating Contextual Embedding of Source Code

被引:0
|
作者
Kanade, Aditya [1 ,2 ]
Maniatis, Petros [2 ]
Balakrishnan, Gogul [2 ]
Shi, Kensen [2 ]
机构
[1] Indian Inst Sci, Bangalore, Karnataka, India
[2] Google Brain, Mountain View, CA 94043 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] ICDBigBird: A Contextual Embedding Model for ICD Code Classification
    Michalopoulos, George
    Malyska, Michal
    Sahar, Nicola
    Wong, Alexander
    Chen, Helen
    PROCEEDINGS OF THE 21ST WORKSHOP ON BIOMEDICAL LANGUAGE PROCESSING (BIONLP 2022), 2022, : 330 - 336
  • [2] Embedding Programming Context into Source Code
    Breckel, Alexander
    Tichy, Matthias
    2016 IEEE 24TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC), 2016,
  • [3] Contextual Information Enhanced Source Code Summarization
    Hu T.-X.
    Xie R.
    Ye W.
    Zhang S.-K.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (04): : 1695 - 1710
  • [4] Hyperbolic Function Embedding: Learning Hierarchical Representation for Functions of Source Code in Hyperbolic Space
    Lu, Mingming
    Liu, Yan
    Li, Haifeng
    Tan, Dingwu
    He, Xiaoxian
    Bi, Wenjie
    Li, Wendbo
    SYMMETRY-BASEL, 2019, 11 (02):
  • [5] Vulnerability Detection for Source Code Using Contextual LSTM
    Xu, Aidong
    Dai, Tao
    Chen, Huajun
    Ming, Zhe
    Li, Weining
    2018 5TH INTERNATIONAL CONFERENCE ON SYSTEMS AND INFORMATICS (ICSAI), 2018, : 1225 - 1230
  • [6] CSGVD: A deep learning approach combining sequence and graph embedding for source code vulnerability detection
    Tang, Wei
    Tang, Mingwei
    Ban, Minchao
    Zhao, Ziguo
    Feng, Mingjun
    JOURNAL OF SYSTEMS AND SOFTWARE, 2023, 199
  • [7] Enriching Source Code with Contextual Data for Code Completion Models: An Empirical Study
    van Dam, Tim
    Izadi, Maliheh
    van Deursen, Arie
    2023 IEEE/ACM 20TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR, 2023, : 170 - 182
  • [8] Approach to Searching Software Source Code with Graph Embedding
    Ling C.-Y.
    Zou Y.-Z.
    Lin Z.-Q.
    Xie B.
    Zhao J.-F.
    Ruan Jian Xue Bao/Journal of Software, 2019, 30 (05): : 1481 - 1497
  • [9] Precise Learning of Source Code Contextual Semantics via Hierarchical Dependence Structure and Graph Attention Networks
    Zhao, Zhehao
    Yang, Bo
    Li, Ge
    Liu, Huai
    Jin, Zhi
    JOURNAL OF SYSTEMS AND SOFTWARE, 2022, 184
  • [10] Precise Learning of Source Code Contextual Semantics via Hierarchical Dependence Structure and Graph Attention Networks
    Zhao, Zhehao
    Yang, Bo
    Li, Ge
    Liu, Huai
    Jin, Zhi
    JOURNAL OF SYSTEMS AND SOFTWARE, 2022, 184