DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

被引：24

作者：

Chen, Yizheng ^{[1
]}

Ding, Zhoujie ^{[2
]}

Alowain, Lamya ^{[3
]}

Chen, Xinyun ^{[4
]}

Wagner, David ^{[2
]}

机构：

[1] Univ Maryland, Baltimore, MD 21201 USA

[2] Univ Calif Berkeley, Berkeley, CA USA

[3] King Abdulaziz City Sci & Technol, Riyadh, Saudi Arabia

[4] Google Deepmind, London, England

来源：

PROCEEDINGS OF THE 26TH INTERNATIONAL SYMPOSIUM ON RESEARCH IN ATTACKS, INTRUSIONS AND DEFENSES, RAID 2023 | 2023年

关键词：

datasets; vulnerability detection; deep learning; large language models;

D O I：

10.1145/3607199.3607242

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 18,945 vulnerable functions spanning 150 CWEs and 330,492 non-vulnerable functions extracted from 7,514 commits. Our dataset covers 295 more projects than all previous datasets combined. Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results showthat deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. We show that increasing the volume of training data may not further improve the performance of deep learning models for vulnerability detection, but might be useful to improve the generalization ability to unseen projects. We also identify hopeful future research directions. We demonstrate that large language models (LLMs) are a promising research direction for ML-based vulnerability detection, outperforming Graph Neural Networks (GNNs) with code-structure features in our experiments. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.

引用

页码：654 / 668

页数：15

共 50 条

[1] An Empirical Study on Vulnerability Detection for Source Code Software based on Deep Learning
Lin, Wei
Cai, Saihua
2021 21ST INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY COMPANION (QRS-C 2021), 2021, : 1159 - 1160
[2] Survey of source code vulnerability analysis based on deep learning
Liang, Chen
Wei, Qiang
Du, Jiang
Wang, Yisen
Jiang, Zirui
COMPUTERS & SECURITY, 2025, 148
[3] Automated Vulnerability Detection in Source Code Using Deep Representation Learning
Russell, Rebecca L.
Kim, Louis
Hamilton, Lei H.
Lazovich, Tomo
Harer, Jacob A.
Ozdemir, Onur
Ellingwood, Paul M.
McConley, Marc W.
2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2018, : 757 - 762
[4] Optimising source code vulnerability detection using deep learning and deep graph network
Xuan, Cho Do
Luong, Tran Thi
Thanh, Ma Cong
CONNECTION SCIENCE, 2025, 37 (01)
[5] Labelled Vulnerability Dataset on Android Source Code (LVDAndro) to Develop AI-Based Code Vulnerability Detection Models
Senanayake, Janaka
Kalutarage, Harsha
Al-Kadri, Mhd Omar
Piras, Luca
Petrovski, Andrei
PROCEEDINGS OF THE 20TH INTERNATIONAL CONFERENCE ON SECURITY AND CRYPTOGRAPHY, SECRYPT 2023, 2023, : 659 - 666
[6] On the Code Vulnerability Detection Based on Deep Learning: A Comparative Study
Li, Guiping
Yang, Yege
IEEE ACCESS, 2024, 12 : 152377 - 152391
[7] Source Code Defect Detection Based on Deep Learning
Wang X.-M.
Zhang T.
Xin W.
Hou C.-Y.
Beijing Ligong Daxue Xuebao/Transaction of Beijing Institute of Technology, 2019, 39 (11): : 1155 - 1159
[8] An empirical evaluation of deep learning-based source code vulnerability detection: Representation versus models
Semasaba, Abubakar Omari Abdallah
Zheng, Wei
Wu, Xiaoxue
Agyemang, Samuel Akwasi
Liu, Tao
Ge, Yuan
JOURNAL OF SOFTWARE-EVOLUTION AND PROCESS, 2023, 35 (11)
[9] Research and Progress on Learning-Based Source Code Vulnerability Detection
Su X.-H.
Zheng W.-N.
Jiang Y.
Wei H.-W.
Wan J.-Y.
Wei Z.-Y.
Jisuanji Xuebao/Chinese Journal of Computers, 2024, 47 (02): : 337 - 374
[10] Literature survey of deep learning-based vulnerability analysis on source code
Semasaba, Abubakar Omari Abdallah
Zheng, Wei
Wu, Xiaoxue
Agyemang, Samuel Akwasi
IET SOFTWARE, 2020, 14 (06) : 654 - 664

← 1 2 3 4 5 →