Benchmarking the benchmark - Comparing synthetic and real-world Network IDS datasets

被引:2
|
作者
Layeghy, Siamak [1 ]
Gallagher, Marcus [1 ]
Marius, Portmann [1 ]
机构
[1] Univ Queensland, Sch ITEE, Brisbane, Qld 4072, Australia
关键词
Network traffic characteristics; Feature distribution; Network Intrusion System (NIDS) dataset; Real-world NIDS dataset; Synthetic NIDS dataset; Machine learning benchmark dataset; ANOMALY DETECTION;
D O I
10.1016/j.jisa.2023.103689
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Network Intrusion Detection Systems (NIDSs) are an increasingly important tool for the prevention and mitigation of cyber attacks. Over the past years, a lot of research efforts have aimed at leveraging the increasingly powerful models of Machine Learning (ML) for this purpose. A number of labelled synthetic datasets have been generated and made publicly available by researchers, and they have become the benchmarks via which new ML -based NIDS classifiers are being evaluated. Recently published results show excellent classification performance with these datasets, increasingly approaching 100 percent performance across key evaluation metrics such as Accuracy, F1 score, AUC, etc. Unfortunately, we have not yet seen these excellent academic research results translated into practical NIDS systems with such near -perfect performance. This motivated our research presented in this paper, where we analyse the statistical properties of the benign traffic in three of the more recent and relevant NIDS datasets, (CIC_IDS, UNSW_NB15, TON_IOT), by converting them into a common flow format. As a comparison, we consider two datasets obtained from real -world production networks, one from a university network and one from a medium size Internet Service Provider (ISP). Our results show that the two real -world datasets are quite similar among themselves in regards to most of the considered statistical features. Equally, the three synthetic datasets are also relatively similar within their group. However, and most importantly, our results show a distinct difference of most of the considered statistical features between the three synthetic datasets and the two real -world datasets. Since ML relies on the basic assumption of training and test datasets being sampled from the same distribution, this raises the question of how well the performance results of ML -classifiers trained on the considered synthetic datasets can translate and generalise to real -world networks. We believe this is an interesting and relevant question which provides motivation for further research in this space.
引用
收藏
页数:18
相关论文
共 50 条
  • [21] On the use of real-world datasets for reaction yield prediction
    Saebi, Mandana
    Nan, Bozhao
    Herr, John E. E.
    Wahlers, Jessica
    Guo, Zhichun
    Zuranski, Andrzej M. M.
    Kogej, Thierry
    Norrby, Per-Ola
    Doyle, Abigail G. G.
    Chawla, Nitesh V. V.
    Wiest, Olaf
    CHEMICAL SCIENCE, 2023, 14 (19) : 4997 - 5005
  • [22] A focus on the use of real-world datasets for yield prediction
    Bustillo, Latimah
    Rodrigues, Tiago
    CHEMICAL SCIENCE, 2023, 14 (19) : 4958 - 4960
  • [23] A Many-Objective Optimization Approach to Generate Synthetic Datasets based on Real-World Classification Problems
    Pereira, Steffano
    Miranda, Pericles
    Franca, Thiago
    Bastos-Filho, Carmelo J. A.
    Si, Tapas
    2022 IEEE LATIN AMERICAN CONFERENCE ON COMPUTATIONAL INTELLIGENCE (LA-CCI), 2022, : 125 - 130
  • [24] Modeling Real-World Load Patterns for Benchmarking in Clouds and Clusters
    Qazi, Kashifuddin
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (06) : 1 - 11
  • [25] Challenges in benchmarking stream learning algorithms with real-world data
    Souza, Vinicius M. A.
    dos Reis, Denis M.
    Maletzke, Andre G.
    Batista, Gustavo E. A. P. A.
    DATA MINING AND KNOWLEDGE DISCOVERY, 2020, 34 (06) : 1805 - 1858
  • [26] Challenges in benchmarking stream learning algorithms with real-world data
    Vinicius M. A. Souza
    Denis M. dos Reis
    André G. Maletzke
    Gustavo E. A. P. A. Batista
    Data Mining and Knowledge Discovery, 2020, 34 : 1805 - 1858
  • [27] Benchmarking Object Detection Robustness against Real-World Corruptions
    Liu, Jiawei
    Wang, Zhijie
    Ma, Lei
    Fang, Chunrong
    Bai, Tongtong
    Zhang, Xufan
    Liu, Jia
    Chen, Zhenyu
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (10) : 4398 - 4416
  • [28] Benchmarking of a Vibration Energy Harvester with Real-world Acceleration Measurements
    Roos, J.
    Blad, T. W. A.
    Spronck, J. W.
    PROCEEDINGS OF 2021 IEEE 30TH INTERNATIONAL SYMPOSIUM ON INDUSTRIAL ELECTRONICS (ISIE), 2021,
  • [29] GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs
    Yuan, Ting
    Li, Guangwei
    Lu, Jie
    Liu, Chen
    Li, Lian
    Xue, Jingling
    CGO '21: PROCEEDINGS OF THE 2021 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), 2021, : 187 - 199
  • [30] A real-world approach to benchmarking DSP real-time operating systems
    Keate, L
    WESCON/97 - CONFERENCE PROCEEDINGS, 1997, : 418 - 424