Learning and Evaluating Contextual Embedding of Source Code

被引:0
|
作者
Kanade, Aditya [1 ,2 ]
Maniatis, Petros [2 ]
Balakrishnan, Gogul [2 ]
Shi, Kensen [2 ]
机构
[1] Indian Inst Sci, Bangalore, Karnataka, India
[2] Google Brain, Mountain View, CA 94043 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Source Code Analysis in Programming Education: Evaluating Learning Content with Self-Organizing Maps
    Jevtic, Marko
    Mladenovic, Sasa
    Granic, Andrina
    APPLIED SCIENCES-BASEL, 2023, 13 (09):
  • [22] Semantic Similarity Metrics for Evaluating Source Code Summarization
    Haque, Sakib
    Eberhart, Zachary
    Bansal, Aakash
    McMillan, Collin
    arXiv, 2022,
  • [23] Semantic Similarity Metrics for Evaluating Source Code Summarization
    Haque, Sakib
    Eberhart, Zachary
    Bansal, Aakash
    McMillan, Collin
    30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 36 - 47
  • [24] Semantic Similarity Metrics for Evaluating Source Code Summarization
    Haque, Sakib
    Eberhart, Zachary
    Bansal, Aakash
    McMillan, Collin
    IEEE International Conference on Program Comprehension, 2022, 2022-March : 36 - 47
  • [25] Evaluating Source Code Summarization Techniques: Replication and Expansion
    Eddy, Brian P.
    Robinson, Jeffrey A.
    Kraft, Nicholas A.
    Carver, Jeffrey C.
    2013 IEEE 21ST INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC), 2013, : 13 - 22
  • [26] Learning distributed word representation with multi-contextual mixed embedding
    Li, Jianqiang
    Li, Jing
    Fu, Xianghua
    Masud, M. A.
    Huang, Joshua Zhexue
    KNOWLEDGE-BASED SYSTEMS, 2016, 106 : 220 - 230
  • [27] A framework for measuring and evaluating program source code quality
    Washizaki, Hironori
    Namiki, Rieko
    Fukuoka, Tomoyuki
    Harada, Yoko
    Watanabe, Hiroyuki
    PRODUCT-FOCUSED SOFTWARE PROCESS IMPROVEMENT, PROCEEDINGS, 2007, 4589 : 284 - +
  • [28] Graph Neural Network contextual embedding for Deep Learning on tabular data
    Villaizan-Vallelado, Mario
    Salvatori, Matteo
    Carro, Belen
    Sanchez-Esguevillas, Antonio Javier
    NEURAL NETWORKS, 2024, 173
  • [29] Cracking the Code: Understanding Source Code Plagiarism to Enable Learning
    Smit, Imelda
    du Plessis, Linda
    SOFTWARE ENGINEERING METHODS DESIGN AND APPLICATION, VOL 1, CSOC 2024, 2024, 1118 : 312 - 326
  • [30] Evaluating Network Embedding Models for Machine Learning Tasks
    Oluigbo, Ikenna
    Haddad, Mohammed
    Seba, Hamida
    COMPLEX NETWORKS AND THEIR APPLICATIONS VIII, VOL 1, 2020, 881 : 915 - 927