Learning and Evaluating Contextual Embedding of Source Code

被引:0
|
作者
Kanade, Aditya [1 ,2 ]
Maniatis, Petros [2 ]
Balakrishnan, Gogul [2 ]
Shi, Kensen [2 ]
机构
[1] Indian Inst Sci, Bangalore, Karnataka, India
[2] Google Brain, Mountain View, CA 94043 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Towards Optimal Binary Code Learning via Ordinal Embedding
    Liu, Hong
    Ji, Rongrong
    Wu, Yongjian
    Liu, Wei
    THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, : 1258 - 1265
  • [32] Improving source code suggestion with code embedding and enhanced convolutional long short-term memory
    Hussain, Yasir
    Huang, Zhiqiu
    Zhou, Yu
    IET SOFTWARE, 2021, 15 (03) : 199 - 213
  • [33] LEARNING TO LAUNCH ENTERPRISES WITH OPEN SOURCE CODE
    Dehter, Mario
    INGENIERIA SOLIDARIA, 2015, 11 (18): : 9 - 21
  • [34] The Growing Cost of Deep Learning for Source Code
    Hellendoorn, Vincent J.
    Sawant, Anand Ashok
    COMMUNICATIONS OF THE ACM, 2022, 65 (01) : 31 - 33
  • [35] A Deep Learning Model for Source Code Generation
    Tiwang, Raymond
    Oladunni, Timothy
    Xu, Weifeng
    2019 IEEE SOUTHEASTCON, 2019,
  • [36] COCLUBERT: Clustering Machine Learning Source Code
    Hagglund, Marcus
    Pena, Francisco J.
    Pashami, Sepideh
    Al-Shishtawy, Ahmad
    Payberah, Amir H.
    20TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2021), 2021, : 151 - 158
  • [37] Source Code Editing Evaluator for Learning Programming
    Chandra, Timotius Nugroho
    Liem, Inggriani
    4TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICEEI 2013), 2013, 11 : 169 - 175
  • [38] Deep Transfer Learning for Source Code Modeling
    Hussain, Yasir
    Huang, Zhiqiu
    Zhou, Yu
    Wang, Senzhang
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2020, 30 (05) : 649 - 668
  • [39] Clinical concept annotation with contextual word embedding in active transfer learning environment
    Abbas, Asim
    Lee, Mark
    Shanavas, Niloofer
    Kovatchev, Venelin
    DIGITAL HEALTH, 2024, 10
  • [40] Evaluating the Performance of LSA for Source-code Plagiarism Detection
    Cosma, Georgina
    Joy, Mike
    INFORMATICA-JOURNAL OF COMPUTING AND INFORMATICS, 2012, 36 (04): : 409 - 424