The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

被引:0
|
作者
Dung Nguyen Manh [1 ]
Nam Le Hai [1 ,3 ]
Dau, Anh T. V. [1 ,3 ]
Anh Minh Nguyen [1 ]
Khanh Nghiem [1 ]
Guo, Jin [4 ,5 ]
Bui, Nghi D. Q. [2 ]
机构
[1] IFPT, Software Ctr, Nanjing, Peoples R China
[2] Fulbright Univ, Ho Chi Minh City, Vietnam
[3] Hanoi Univ Sci & Technol, Hanoi, Vietnam
[4] McGill Univ, Sch Comp Sci, Montreal, PQ, Canada
[5] Mila Quebec AI Inst, Montreal, PQ, Canada
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023 | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models.
引用
收藏
页码:4763 / 4788
页数:26
相关论文
共 50 条
  • [21] BINCODEX: A comprehensive and multi-level dataset for evaluating binary code similarity detection techniques
    Zhang P.
    Wu C.
    Wang Z.
    BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 2024, 4 (02):
  • [22] Advancing Understanding and Therapeutic Strategies for NUT Sarcomas: Comprehensive Review of the Literature and Two Cases
    Torrado, Carlos
    Haddad, Elise Nassif
    Somaiah, Neeta
    Msaouel, Pavlos
    Lazar, Alexander J.
    Piha-Paul, Sarina A.
    JOURNAL OF IMMUNOTHERAPY AND PRECISION ONCOLOGY, 2025, 8 (02) : 113 - 120
  • [23] ICG: A Machine Learning Benchmark Dataset and Baselines for Inline Code Comments Generation Task
    Zhang, Xiaowei
    Chen, Lin
    Zou, Weiqin
    Cao, Yulu
    Ren, Hao
    Wang, Zhi
    Li, Yanhui
    Zhou, Yuming
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2024, 34 (02) : 331 - 356
  • [24] Collecting Vulnerable Source Code from Open-Source Repositories for Dataset Generation
    Raducu, Razvan
    Esteban, Gonzalo
    Rodriguez Lera, Francisco J.
    Fernandez, Camino
    APPLIED SCIENCES-BASEL, 2020, 10 (04):
  • [25] ChainTracker: Towards a Comprehensive Tool for Building Code-Generation Environments
    Guana, Victor
    Gaboriau, Kelsey
    Stroulia, Eleni
    2014 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME), 2014, : 613 - 616
  • [26] HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding
    Zheng, Hao
    Lee, Regina
    Lu, Yuqian
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [27] OMAD-6: Advancing Offshore Mariculture Monitoring with a Comprehensive Six-Type Dataset and Performance Benchmark
    Mo, Zewen
    Liang, Yinyu
    Chen, Yulin
    Shen, Yanyun
    Xu, Minduan
    Wang, Zhipan
    Zhang, Qingling
    Remote Sensing, 2024, 16 (23)
  • [28] MetaFruit meets foundation models: Leveraging a comprehensive multi-fruit dataset for advancing agricultural foundation models
    Li, Jiajia
    Lammers, Kyle
    Yin, Xunyuan
    Yin, Xiang
    He, Long
    Sheng, Jun
    Lu, Renfu
    Li, Zhaojian
    COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2025, 231
  • [29] A Comprehensive Understanding of Code-Mixed Language Semantics Using Hierarchical Transformer
    Suresh, Tharun
    Sengupta, Ayan
    Akhtar, Md Shad
    Chakraborty, Tanmoy
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (03) : 4139 - 4148
  • [30] CoPeD-Advancing Multi-Robot Collaborative Perception: A Comprehensive Dataset in Real-World Environments
    Zhou, Yang
    Quang, Long
    Nieto-Granda, Carlos
    Loianno, Giuseppe
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (07): : 6416 - 6423