Evaluation of Synthetic Data Generation Techniques in the Domain of Anonymous Traffic Classification

被引:4
|
作者
Cullen, Drake [1 ]
Halladay, James [1 ]
Briner, Nathan [1 ]
Basnet, Ram [1 ]
Bergen, Jeremy [1 ]
Doleck, Tenzin [2 ]
机构
[1] Colorado Mesa Univ CMU, Dept Comp Sci & Engn, Grand Junction, CO 81501 USA
[2] Simon Fraser Univ, Fac Educ, Burnaby, BC V5A 1S6, Canada
来源
IEEE ACCESS | 2022年 / 10卷
关键词
Anonymous traffic; synthetic data; CopulaGAN; CTGAN; SMOTE; VAE; TabNet; deep learning; machine learning; unbalanced data;
D O I
10.1109/ACCESS.2022.3228507
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Anonymous network traffic is more pervasive than ever due to the accessibility of services such as virtual private networks (VPN) and The Onion Router (Tor). To address the need to identify and classify this traffic, machine and deep learning solutions have become the standard. However, high performing classifiers often scale poorly when applied to real-world traffic classification due to the heavily skewed nature of network traffic data. Prior research has found synthetic data generation to be effective at alleviating concerns surrounding class imbalance, though a limited number of these techniques have been applied to the domain of anonymous network traffic detection. This work compares the ability of a Conditional Tabular Generative Adversarial Network (CTGAN), Copula Generative Adversarial Network (CopulaGAN), Variational Autoencoder (VAE), and Synthetic Minority Over-sampling Technique (SMOTE) to create viable synthetic anonymous network traffic samples. Moreover, we evaluate the performance of several shallow boosting and bagging classifiers as well as deep learning models on the synthetic data. Ultimately, we amalgamate the data generated by the GANs, VAE, and SMOTE into a comprehensive dataset dubbed CMU-SynTraffic-2022 for future research on this topic. Our findings show that SMOTE consistently outperformed the other upsampling techniques, improving classifiers' F1-scores over the control by similar to 7.5% for application type characterization. Among the tested classifiers, Light Gradient Boosting Machine achieved the highest F1-score of 90.3% on eight application types.
引用
收藏
页码:129612 / 129625
页数:14
相关论文
共 50 条
  • [31] Classification of UTGen Synthetic Traffic Generator
    Patil, Abhishek G.
    Surve, Anil
    Gupta, Anil Kumar
    2016 CONFERENCE ON ADVANCES IN SIGNAL PROCESSING (CASP), 2016, : 280 - 285
  • [32] A Systematic Review of Synthetic Data Generation Techniques Using Generative AI
    Goyal, Mandeep
    Mahmoud, Qusay H.
    ELECTRONICS, 2024, 13 (17)
  • [33] Transfer Learning for Time Series Classification Using Synthetic Data Generation
    Rotem, Yarden
    Shimoni, Nathaniel
    Rokach, Lior
    Shapira, Bracha
    CYBER SECURITY, CRYPTOLOGY, AND MACHINE LEARNING, 2022, 13301 : 232 - 246
  • [34] A synthetic neighborhood generation based ensemble learning for the imbalanced data classification
    Chen, Zhi
    Lin, Tao
    Xia, Xin
    Xu, Hongyan
    Ding, Sha
    APPLIED INTELLIGENCE, 2018, 48 (08) : 2441 - 2457
  • [35] A synthetic neighborhood generation based ensemble learning for the imbalanced data classification
    Zhi Chen
    Tao Lin
    Xin Xia
    Hongyan Xu
    Sha Ding
    Applied Intelligence, 2018, 48 : 2441 - 2457
  • [36] Impact of Clustering on a Synthetic Instance Generation in Imbalanced Data Streams Classification
    Czarnowski, Ireneusz
    Martins, Denis Mayr Lima
    COMPUTATIONAL SCIENCE, ICCS 2022, PT II, 2022, : 586 - 597
  • [37] Evaluation of neutron spectrometer techniques for ITER using synthetic data
    Sunden, E. Andersson
    Ballabio, L.
    Cecconello, M.
    Conroy, S.
    Ericsson, G.
    Johnson, M. Gatu
    Gorini, G.
    Hellesen, C.
    Ognissanto, F.
    Ronchi, E.
    Sjoestrand, H.
    Tardocchi, M.
    Weiszflog, M.
    NUCLEAR INSTRUMENTS & METHODS IN PHYSICS RESEARCH SECTION A-ACCELERATORS SPECTROMETERS DETECTORS AND ASSOCIATED EQUIPMENT, 2013, 701 : 62 - 71
  • [38] Feature Relevance Analysis and Classification of Road Traffic Accident Data through Data Mining Techniques
    Shanthi, S.
    Ramani, R. Geetha
    WORLD CONGRESS ON ENGINEERING AND COMPUTER SCIENCE, WCECS 2012, VOL I, 2012, : 122 - 127
  • [39] An Approach for Classification of Network Traffic on Semi - Supervised Data using Clustering Techniques
    Shukla, Dheeraj Basant
    Chandel, Gajendra Singh
    2013 4TH NIRMA UNIVERSITY INTERNATIONAL CONFERENCE ON ENGINEERING (NUICONE 2013), 2013,
  • [40] Network Traffic Classification Techniques and Challenges
    Al Khater, Noora
    Overill, Richard E.
    2015 TENTH INTERNATIONAL CONFERENCE ON DIGITAL INFORMATION MANAGEMENT (ICDIM), 2015, : 109 - 114