Benchmarking the benchmark - Comparing synthetic and real-world Network IDS datasets

被引:2
|
作者
Layeghy, Siamak [1 ]
Gallagher, Marcus [1 ]
Marius, Portmann [1 ]
机构
[1] Univ Queensland, Sch ITEE, Brisbane, Qld 4072, Australia
关键词
Network traffic characteristics; Feature distribution; Network Intrusion System (NIDS) dataset; Real-world NIDS dataset; Synthetic NIDS dataset; Machine learning benchmark dataset; ANOMALY DETECTION;
D O I
10.1016/j.jisa.2023.103689
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Network Intrusion Detection Systems (NIDSs) are an increasingly important tool for the prevention and mitigation of cyber attacks. Over the past years, a lot of research efforts have aimed at leveraging the increasingly powerful models of Machine Learning (ML) for this purpose. A number of labelled synthetic datasets have been generated and made publicly available by researchers, and they have become the benchmarks via which new ML -based NIDS classifiers are being evaluated. Recently published results show excellent classification performance with these datasets, increasingly approaching 100 percent performance across key evaluation metrics such as Accuracy, F1 score, AUC, etc. Unfortunately, we have not yet seen these excellent academic research results translated into practical NIDS systems with such near -perfect performance. This motivated our research presented in this paper, where we analyse the statistical properties of the benign traffic in three of the more recent and relevant NIDS datasets, (CIC_IDS, UNSW_NB15, TON_IOT), by converting them into a common flow format. As a comparison, we consider two datasets obtained from real -world production networks, one from a university network and one from a medium size Internet Service Provider (ISP). Our results show that the two real -world datasets are quite similar among themselves in regards to most of the considered statistical features. Equally, the three synthetic datasets are also relatively similar within their group. However, and most importantly, our results show a distinct difference of most of the considered statistical features between the three synthetic datasets and the two real -world datasets. Since ML relies on the basic assumption of training and test datasets being sampled from the same distribution, this raises the question of how well the performance results of ML -classifiers trained on the considered synthetic datasets can translate and generalise to real -world networks. We believe this is an interesting and relevant question which provides motivation for further research in this space.
引用
收藏
页数:18
相关论文
共 50 条
  • [31] A Real-World Quadrupedal Locomotion Benchmark for Offline Reinforcement Learning
    Zhang, Hongyin
    Yang, Shuyu
    Wang, Donglin
    2024 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN 2024, 2024,
  • [32] NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning
    Qin, Rong-Jun
    Zhang, Xingyuan
    Gao, Songyi
    Chen, Xiong-Hui
    Li, Zewen
    Zhang, Weinan
    Yu, Yang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [33] Learning real-world heterogeneous noise models with a benchmark dataset
    Sun, Lu
    Lin, Jie
    Dong, Weisheng
    Li, Xin
    Wu, Jinjian
    Shi, Guangming
    PATTERN RECOGNITION, 2024, 156
  • [34] ContainerGym: A Real-World Reinforcement Learning Benchmark for Resource Allocation
    Pendyala, Abhijeet
    Dettmer, Justin
    Glasmachers, Tobias
    Atamna, Asma
    MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE, LOD 2023, PT I, 2024, 14505 : 78 - 92
  • [35] Beyond data sharing: Using real-world data for teaching real-world computational workflows and for benchmarking new methods
    Jansen, Johanna
    Amaro, Rommie
    Tseng, Y. Jane
    Cornell, Wendy
    Esposito, Emilio
    Walters, Pat
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2016, 252
  • [36] Sensor Faults: Detection Methods and Prevalence in Real-World Datasets
    Sharma, Abhishek B.
    Golubchik, Leana
    Govindan, Ramesh
    ACM TRANSACTIONS ON SENSOR NETWORKS, 2010, 6 (03)
  • [37] A Higher Order Mining Approach for the Analysis of Real-World Datasets
    Abghari, Shahrooz
    Boeva, Veselka
    Brage, Jens
    Grahn, Hakan
    ENERGIES, 2020, 13 (21)
  • [38] Computing with CORGIS: Diverse, Real-world Datasets for Introductory Computing
    Bart, Austin Cory
    Whitcomb, Ryan
    Kafura, Dennis
    Shaffer, Clifford A.
    Tilevich, Eli
    PROCEEDINGS OF THE 2017 ACM SIGCSE TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION (SIGCSE'17), 2017,
  • [39] Computing with CORGIS: Diverse, Real-world Datasets for Introductory Computing
    Bart, Austin Cory
    Whitcomb, Ryan
    Kafura, Dennis
    Shaffer, Clifford A.
    Tilevich, Eli
    PROCEEDINGS OF THE 2017 ACM SIGCSE TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION (SIGCSE'17), 2017, : 57 - 62
  • [40] Real-World IPTV Network Measurements
    Baltoglou, Georgios
    Karapistoli, Eirini
    Chatzimisios, Periklis
    2011 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (ISCC), 2011,