The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

被引:0
|
作者
Dung Nguyen Manh [1 ]
Nam Le Hai [1 ,3 ]
Dau, Anh T. V. [1 ,3 ]
Anh Minh Nguyen [1 ]
Khanh Nghiem [1 ]
Guo, Jin [4 ,5 ]
Bui, Nghi D. Q. [2 ]
机构
[1] IFPT, Software Ctr, Nanjing, Peoples R China
[2] Fulbright Univ, Ho Chi Minh City, Vietnam
[3] Hanoi Univ Sci & Technol, Hanoi, Vietnam
[4] McGill Univ, Sch Comp Sci, Montreal, PQ, Canada
[5] Mila Quebec AI Inst, Montreal, PQ, Canada
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023 | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models.
引用
收藏
页码:4763 / 4788
页数:26
相关论文
共 50 条
  • [31] MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages
    FitzGerald, Jack
    Hench, Christopher
    Peris, Charith
    Mackie, Scott
    Rottmann, Kay
    Sanchez, Ana
    Nash, Aaron
    Urbach, Liam
    Kakarala, Vishesh
    Singh, Richa
    Ranganath, Swetha
    Crist, Laurie
    Britan, Misha
    Leeuwis, Wouter
    Tur, Gokhan
    Natarajan, Prem
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4277 - 4302
  • [32] Popular Hooks: A Multimodal Dataset of Musical Hooks for Music Understanding and Generation<bold> </bold>
    Wu, Xinda
    Wang, Jiaming
    Yu, Jiaxing
    Zhang, Tieyao
    Zhang, Kejun
    2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS, ICMEW 2024, 2024,
  • [33] Advancing sustainability in LNG-Powered electricity generation: A comprehensive life cycle sustainability assessment
    Al-Kuwari, Ahmad
    Kucukvar, Murat
    Onat, Nuri C.
    Al-Yafei, Hussein
    Alnouss, Ahmed
    ENERGY CONVERSION AND MANAGEMENT-X, 2025, 26
  • [34] Advancing 3D point cloud understanding through deep transfer learning: A comprehensive survey
    Sohail, Shahab Saquib
    Himeur, Yassine
    Kheddar, Hamza
    Amira, Abbes
    Fadli, Fodil
    Atalla, Shadi
    Copiaco, Igail
    INFORMATION FUSION, 2025, 113
  • [35] Advancing understanding of Ficus carica: a comprehensive genomic analysis reveals evolutionary patterns and metabolic pathway insights
    Bao, Yuting
    He, Miaohua
    Zhang, Chenji
    Jiang, Sirong
    Zhao, Long
    Ye, Zhengwen
    Sun, Qian
    Xia, Zhiqiang
    Zou, Meiling
    FRONTIERS IN PLANT SCIENCE, 2023, 14
  • [36] SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques
    Siddiq, Mohammed Latif
    Santos, Joanna C. S.
    PROCEEDINGS OF THE 1ST INTERNATIONAL WORKSHOP ON MINING SOFTWARE REPOSITORIES APPLICATIONS FOR PRIVACY AND SECURITY, MSR4P&S 2022, 2022, : 29 - 33
  • [37] JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation
    Agashe, Rajas
    Iyer, Srinivasan
    Zettlemoyer, Luke
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 5436 - 5446
  • [38] A Comprehensive Framework for Evaluating API-oriented Code Generation in Large Language Models
    Wu, Yixi
    He, Pengfei
    Wang, Zehao
    Wang, Shaowei
    Tian, Yuan
    Chen, Tse-Hsun
    arXiv,
  • [39] XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation
    Liang, Yaobo
    Duan, Nan
    Gong, Yeyun
    Wu, Ning
    Guo, Fenfei
    Qi, Weizhen
    Gong, Ming
    Shou, Linjun
    Jiang, Daxin
    Cao, Guihong
    Fan, Xiaodong
    Zhang, Ruofei
    Agrawal, Rahul
    Cui, Edward
    Wei, Sining
    Bharti, Taroon
    Qiao, Ying
    Chen, Jiun-Hung
    Wu, Winnie
    Liu, Shuguang
    Yang, Fan
    Campos, Daniel
    Majumder, Rangan
    Zhou, Ming
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6008 - 6018
  • [40] AST-T5: Structure-Aware Pretraining for Code Generation and Understanding
    Gong, Linyuan
    Elhoushi, Mostafa
    Cheung, Alvin
    arXiv,