Benchmarking the benchmark - Comparing synthetic and real-world Network IDS datasets

被引:2
|
作者
Layeghy, Siamak [1 ]
Gallagher, Marcus [1 ]
Marius, Portmann [1 ]
机构
[1] Univ Queensland, Sch ITEE, Brisbane, Qld 4072, Australia
关键词
Network traffic characteristics; Feature distribution; Network Intrusion System (NIDS) dataset; Real-world NIDS dataset; Synthetic NIDS dataset; Machine learning benchmark dataset; ANOMALY DETECTION;
D O I
10.1016/j.jisa.2023.103689
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Network Intrusion Detection Systems (NIDSs) are an increasingly important tool for the prevention and mitigation of cyber attacks. Over the past years, a lot of research efforts have aimed at leveraging the increasingly powerful models of Machine Learning (ML) for this purpose. A number of labelled synthetic datasets have been generated and made publicly available by researchers, and they have become the benchmarks via which new ML -based NIDS classifiers are being evaluated. Recently published results show excellent classification performance with these datasets, increasingly approaching 100 percent performance across key evaluation metrics such as Accuracy, F1 score, AUC, etc. Unfortunately, we have not yet seen these excellent academic research results translated into practical NIDS systems with such near -perfect performance. This motivated our research presented in this paper, where we analyse the statistical properties of the benign traffic in three of the more recent and relevant NIDS datasets, (CIC_IDS, UNSW_NB15, TON_IOT), by converting them into a common flow format. As a comparison, we consider two datasets obtained from real -world production networks, one from a university network and one from a medium size Internet Service Provider (ISP). Our results show that the two real -world datasets are quite similar among themselves in regards to most of the considered statistical features. Equally, the three synthetic datasets are also relatively similar within their group. However, and most importantly, our results show a distinct difference of most of the considered statistical features between the three synthetic datasets and the two real -world datasets. Since ML relies on the basic assumption of training and test datasets being sampled from the same distribution, this raises the question of how well the performance results of ML -classifiers trained on the considered synthetic datasets can translate and generalise to real -world networks. We believe this is an interesting and relevant question which provides motivation for further research in this space.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Dehazing Evaluation: Real-World Benchmark Datasets, Criteria, and Baselines
    Zhao, Shiyu
    Zhang, Lin
    Huang, Shuaiyi
    Shen, Ying
    Zhao, Shengjie
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 6947 - 6962
  • [2] Comparing Password Ranking Algorithms on Real-World Password Datasets
    Yang, Weining
    Li, Ninghui
    Molloy, Ian M.
    Park, Youngja
    Chari, Suresh N.
    COMPUTER SECURITY - ESORICS 2016, PT I, 2016, 9878 : 69 - 90
  • [3] Beyond Real-world Benchmark Datasets: An Empirical Study of Node Classification with GNNs
    Maekawa, Seiji
    Noda, Koki
    Sasaki, Yuya
    Onizuka, Makoto
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [4] Linear neural network training algorithms for real-world benchmark problems
    Goulianas, K
    Adamopoulos, M
    Katsavounis, S
    Fragakis, C
    Tsouros, CC
    INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS, 2002, 79 (11) : 1149 - 1167
  • [5] Learning to Detect Traffic Signs: Comparative Evaluation of Synthetic and Real-world Datasets
    Mogelmose, Andreas
    Trivedi, Mohan M.
    Moeslund, Thomas B.
    2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 3452 - 3455
  • [6] Sequential Clustering for Real-World Datasets
    Huang, Chongwei
    Hou, Jian
    Yuan, Huaqiang
    PRICAI 2024: TRENDS IN ARTIFICIAL INTELLIGENCE, PT I, 2025, 15281 : 69 - 80
  • [7] A Computational Challenge Problem in Materials Discovery: Synthetic Problem Generator and Real-World Datasets
    Le Bras, Ronan
    Bernstein, Richard
    Gregoire, John M.
    Suram, Santosh K.
    Gomes, Carla P.
    Selman, Bart
    van Dover, R. Bruce
    PROCEEDINGS OF THE TWENTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2014, : 438 - 443
  • [8] Real-World Video Deblurring: A Benchmark Dataset and an Efficient Recurrent Neural Network
    Zhong, Zhihang
    Gao, Ye
    Zheng, Yinqiang
    Zheng, Bo
    Sato, Imari
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (01) : 284 - 301
  • [9] Real-World Video Deblurring: A Benchmark Dataset and an Efficient Recurrent Neural Network
    Zhihang Zhong
    Ye Gao
    Yinqiang Zheng
    Bo Zheng
    Imari Sato
    International Journal of Computer Vision, 2023, 131 : 284 - 301
  • [10] A Real-World Benchmark Problem for Global Optimization
    Yuriy, Romasevych
    Viatcheslav, Loveikin
    Borys, Bakay
    CYBERNETICS AND INFORMATION TECHNOLOGIES, 2023, 23 (03) : 23 - 39