The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

被引:0
|
作者
Dung Nguyen Manh [1 ]
Nam Le Hai [1 ,3 ]
Dau, Anh T. V. [1 ,3 ]
Anh Minh Nguyen [1 ]
Khanh Nghiem [1 ]
Guo, Jin [4 ,5 ]
Bui, Nghi D. Q. [2 ]
机构
[1] IFPT, Software Ctr, Nanjing, Peoples R China
[2] Fulbright Univ, Ho Chi Minh City, Vietnam
[3] Hanoi Univ Sci & Technol, Hanoi, Vietnam
[4] McGill Univ, Sch Comp Sci, Montreal, PQ, Canada
[5] Mila Quebec AI Inst, Montreal, PQ, Canada
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023 | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models.
引用
收藏
页码:4763 / 4788
页数:26
相关论文
共 50 条
  • [41] AST-T5: Structure-Aware Pretraining for Code Generation and Understanding
    Gong, Linyuan
    Elhoushi, Mostafa
    Cheung, Alvin
    Proceedings of Machine Learning Research, 2024, 235 : 15839 - 15853
  • [42] Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review
    Wong, Man-Fai
    Guo, Shangxin
    Hang, Ching-Nam
    Ho, Siu-Wai
    Tan, Chee-Wei
    ENTROPY, 2023, 25 (06)
  • [43] A Comprehensive Literature Review on Artificial Dataset Generation for Repositioning Challenges in Shared Electric Automated and Connected Mobility
    Kayisu, Antoine Kazadi
    Kambale, Witesyavwirwa Vianney
    Benarbia, Taha
    Bokoro, Pitshou Ntambu
    Kyamakya, Kyandoghere
    SYMMETRY-BASEL, 2024, 16 (01):
  • [44] AUTOMATIC KEY MOMENT EXTRACTION AND HIGHLIGHTS GENERATION BASED ON COMPREHENSIVE SOCCER VIDEO UNDERSTANDING
    Gao, Xin
    Liu, Xusheng
    Yang, Taotao
    Deng, Guilin
    Peng, Hao
    Zhang, Qiaosong
    Li, Hai
    Liu, Junhui
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2020,
  • [45] Advancing the understanding of venetoclax in t(11;14)-positive multiple myeloma: a comprehensive review of clinical evidence and future prospects
    AlZahrani, Abdullah
    Alsuhebany, Nada
    Tailor, Imran K.
    Alrajhi, Abdullah M.
    HEMATOLOGY, 2024, 29 (01)
  • [46] Code comment generation based on graph neural network enhanced transformer model for code understanding in open-source software ecosystems
    Li Kuang
    Cong Zhou
    Xiaoxian Yang
    Automated Software Engineering, 2022, 29
  • [47] Code comment generation based on graph neural network enhanced transformer model for code understanding in open-source software ecosystems
    Kuang, Li
    Zhou, Cong
    Yang, Xiaoxian
    AUTOMATED SOFTWARE ENGINEERING, 2022, 29 (02)
  • [48] MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for Natural Language Understanding in Task-Oriented Dialogue
    Moghe, Nikita
    Razumovskaia, Evgeniia
    Guillou, Liane
    Vulic, Ivan
    Korhonen, Anna
    Birch, Alexandra
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 3732 - 3755
  • [49] Code Generation for UML 2 Activity Diagrams Towards a Comprehensive Model-Driven Development Approach
    Gessenharter, Dominik
    Rauscher, Martin
    MODELLING FOUNDATIONS AND APPLICATIONS, 2011, 6698 : 205 - 220
  • [50] Advancing understanding of the complex nature of flood risks to inform comprehensive risk management: Findings from an urban region in Central Vietnam
    Sett, Dominic
    Trinh, Thao Phuong
    Wasim, Tuba
    Ortiz-Vargas, Andrea
    Nguyen, Dang Giang Chau
    Bueche, Kerstin
    Assmann, Andre
    Nguyen, Hoang Khanh Linh
    Walz, Yvonne
    Souvignet, Maxime
    Bachofer, Felix
    Vu, Thanh Bien
    Garschagen, Matthias
    Hagenlocher, Michael
    INTERNATIONAL JOURNAL OF DISASTER RISK REDUCTION, 2024, 110