The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

被引:0
|
作者
Dung Nguyen Manh [1 ]
Nam Le Hai [1 ,3 ]
Dau, Anh T. V. [1 ,3 ]
Anh Minh Nguyen [1 ]
Khanh Nghiem [1 ]
Guo, Jin [4 ,5 ]
Bui, Nghi D. Q. [2 ]
机构
[1] IFPT, Software Ctr, Nanjing, Peoples R China
[2] Fulbright Univ, Ho Chi Minh City, Vietnam
[3] Hanoi Univ Sci & Technol, Hanoi, Vietnam
[4] McGill Univ, Sch Comp Sci, Montreal, PQ, Canada
[5] Mila Quebec AI Inst, Montreal, PQ, Canada
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023 | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models.
引用
收藏
页码:4763 / 4788
页数:26
相关论文
共 50 条
  • [1] CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation
    Yan, Weixiang
    Tian, Yuchen
    Li, Yunzhe
    Chen, Qian
    Wang, Wen
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 5067 - 5089
  • [2] EUROPA: A Legal Multilingual Keyphrase Generation Dataset
    Salaun, Olivier
    Piedboeuf, Frederic
    Le Berre, Guillaume
    Hermelo, David Alfonso
    Langlais, Philippe
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12718 - 12736
  • [3] A new dataset for French and multilingual keyphrase generation
    Piedboeuf, Frederic
    Langlais, Philippe
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [4] XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding
    Xu, Yiheng
    Lv, Tengchao
    Cui, Lei
    Wang, Guoxin
    Lu, Yijuan
    Florencio, Dinei
    Zhang, Cha
    Wei, Furu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 3214 - 3224
  • [5] Constructing Multilingual Code Search Dataset Using Neural Machine Translation
    Sekizawa, Ryo
    Duan, Nan
    Lu, Shuai
    Yanaka, Hitomi
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-SRW 2023, VOL 4, 2023, : 69 - 75
  • [6] Understanding Fatigue Through Biosignals: A Comprehensive Dataset
    Gabbi, Marta
    Cornia, Luca
    Villani, Valeria
    Sabattini, Lorenzo
    PROCEEDINGS OF THE 2024 ACM/IEEE INTERNATIONAL CONFERENCE ON HUMAN-ROBOT INTERACTION, HRI 2024, 2024, : 901 - 905
  • [7] Frances: A Tool For Understanding Code Generation
    Sondag, Tyler
    Pokorny, Kian L.
    Rajan, Hridesh
    SIGCSE 10: PROCEEDINGS OF THE 41ST ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, 2010, : 12 - 16
  • [8] Towards a comprehensive understanding of the MAP code.
    Monroy, B.
    Tan, T. C.
    Ramkumar, A.
    Nowakowski, D. W.
    Ori-McKenney, K.
    MOLECULAR BIOLOGY OF THE CELL, 2018, 29 (26) : 130 - 130
  • [9] Transformers in source code generation: A comprehensive survey
    Ghaemi, Hadi
    Alizadehsani, Zakieh
    Shahraki, Amin
    Corchado, Juan M.
    JOURNAL OF SYSTEMS ARCHITECTURE, 2024, 153
  • [10] funRiceGenes dataset for comprehensive understanding and application of rice functional genes
    Yao, Wen
    Li, Guangwei
    Yu, Yiming
    Ouyang, Yidan
    GIGASCIENCE, 2017, 7 (01):