DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

被引:24
|
作者
Chen, Yizheng [1 ]
Ding, Zhoujie [2 ]
Alowain, Lamya [3 ]
Chen, Xinyun [4 ]
Wagner, David [2 ]
机构
[1] Univ Maryland, Baltimore, MD 21201 USA
[2] Univ Calif Berkeley, Berkeley, CA USA
[3] King Abdulaziz City Sci & Technol, Riyadh, Saudi Arabia
[4] Google Deepmind, London, England
关键词
datasets; vulnerability detection; deep learning; large language models;
D O I
10.1145/3607199.3607242
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 18,945 vulnerable functions spanning 150 CWEs and 330,492 non-vulnerable functions extracted from 7,514 commits. Our dataset covers 295 more projects than all previous datasets combined. Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results showthat deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. We show that increasing the volume of training data may not further improve the performance of deep learning models for vulnerability detection, but might be useful to improve the generalization ability to unseen projects. We also identify hopeful future research directions. We demonstrate that large language models (LLMs) are a promising research direction for ML-based vulnerability detection, outperforming Graph Neural Networks (GNNs) with code-structure features in our experiments. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.
引用
收藏
页码:654 / 668
页数:15
相关论文
共 50 条
  • [1] An Empirical Study on Vulnerability Detection for Source Code Software based on Deep Learning
    Lin, Wei
    Cai, Saihua
    2021 21ST INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY COMPANION (QRS-C 2021), 2021, : 1159 - 1160
  • [2] Survey of source code vulnerability analysis based on deep learning
    Liang, Chen
    Wei, Qiang
    Du, Jiang
    Wang, Yisen
    Jiang, Zirui
    COMPUTERS & SECURITY, 2025, 148
  • [3] Automated Vulnerability Detection in Source Code Using Deep Representation Learning
    Russell, Rebecca L.
    Kim, Louis
    Hamilton, Lei H.
    Lazovich, Tomo
    Harer, Jacob A.
    Ozdemir, Onur
    Ellingwood, Paul M.
    McConley, Marc W.
    2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2018, : 757 - 762
  • [4] Optimising source code vulnerability detection using deep learning and deep graph network
    Xuan, Cho Do
    Luong, Tran Thi
    Thanh, Ma Cong
    CONNECTION SCIENCE, 2025, 37 (01)
  • [5] Labelled Vulnerability Dataset on Android Source Code (LVDAndro) to Develop AI-Based Code Vulnerability Detection Models
    Senanayake, Janaka
    Kalutarage, Harsha
    Al-Kadri, Mhd Omar
    Piras, Luca
    Petrovski, Andrei
    PROCEEDINGS OF THE 20TH INTERNATIONAL CONFERENCE ON SECURITY AND CRYPTOGRAPHY, SECRYPT 2023, 2023, : 659 - 666
  • [6] On the Code Vulnerability Detection Based on Deep Learning: A Comparative Study
    Li, Guiping
    Yang, Yege
    IEEE ACCESS, 2024, 12 : 152377 - 152391
  • [7] Source Code Defect Detection Based on Deep Learning
    Wang X.-M.
    Zhang T.
    Xin W.
    Hou C.-Y.
    Beijing Ligong Daxue Xuebao/Transaction of Beijing Institute of Technology, 2019, 39 (11): : 1155 - 1159
  • [8] An empirical evaluation of deep learning-based source code vulnerability detection: Representation versus models
    Semasaba, Abubakar Omari Abdallah
    Zheng, Wei
    Wu, Xiaoxue
    Agyemang, Samuel Akwasi
    Liu, Tao
    Ge, Yuan
    JOURNAL OF SOFTWARE-EVOLUTION AND PROCESS, 2023, 35 (11)
  • [9] Research and Progress on Learning-Based Source Code Vulnerability Detection
    Su X.-H.
    Zheng W.-N.
    Jiang Y.
    Wei H.-W.
    Wan J.-Y.
    Wei Z.-Y.
    Jisuanji Xuebao/Chinese Journal of Computers, 2024, 47 (02): : 337 - 374
  • [10] Literature survey of deep learning-based vulnerability analysis on source code
    Semasaba, Abubakar Omari Abdallah
    Zheng, Wei
    Wu, Xiaoxue
    Agyemang, Samuel Akwasi
    IET SOFTWARE, 2020, 14 (06) : 654 - 664