Evaluation of Synthetic Data Generation Techniques in the Domain of Anonymous Traffic Classification

被引:4
|
作者
Cullen, Drake [1 ]
Halladay, James [1 ]
Briner, Nathan [1 ]
Basnet, Ram [1 ]
Bergen, Jeremy [1 ]
Doleck, Tenzin [2 ]
机构
[1] Colorado Mesa Univ CMU, Dept Comp Sci & Engn, Grand Junction, CO 81501 USA
[2] Simon Fraser Univ, Fac Educ, Burnaby, BC V5A 1S6, Canada
来源
IEEE ACCESS | 2022年 / 10卷
关键词
Anonymous traffic; synthetic data; CopulaGAN; CTGAN; SMOTE; VAE; TabNet; deep learning; machine learning; unbalanced data;
D O I
10.1109/ACCESS.2022.3228507
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Anonymous network traffic is more pervasive than ever due to the accessibility of services such as virtual private networks (VPN) and The Onion Router (Tor). To address the need to identify and classify this traffic, machine and deep learning solutions have become the standard. However, high performing classifiers often scale poorly when applied to real-world traffic classification due to the heavily skewed nature of network traffic data. Prior research has found synthetic data generation to be effective at alleviating concerns surrounding class imbalance, though a limited number of these techniques have been applied to the domain of anonymous network traffic detection. This work compares the ability of a Conditional Tabular Generative Adversarial Network (CTGAN), Copula Generative Adversarial Network (CopulaGAN), Variational Autoencoder (VAE), and Synthetic Minority Over-sampling Technique (SMOTE) to create viable synthetic anonymous network traffic samples. Moreover, we evaluate the performance of several shallow boosting and bagging classifiers as well as deep learning models on the synthetic data. Ultimately, we amalgamate the data generated by the GANs, VAE, and SMOTE into a comprehensive dataset dubbed CMU-SynTraffic-2022 for future research on this topic. Our findings show that SMOTE consistently outperformed the other upsampling techniques, improving classifiers' F1-scores over the control by similar to 7.5% for application type characterization. Among the tested classifiers, Light Gradient Boosting Machine achieved the highest F1-score of 90.3% on eight application types.
引用
收藏
页码:129612 / 129625
页数:14
相关论文
共 50 条
  • [41] Empirical Evaluation on Synthetic Data Generation with Generative Adversarial Network
    Lu, Pei-Hsuan
    Wang, Pang-Chieh
    Yu, Chia-Mu
    PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, MINING AND SEMANTICS (WIMS 2019), 2019,
  • [42] Ship Classification from Overhead Imagery using Synthetic Data and Domain Adaptation
    Ward, Chris M.
    Harguess, Josh
    Hilton, Cameron
    OCEANS 2018 MTS/IEEE CHARLESTON, 2018,
  • [43] Evaluation of synthetic data generation for intelligent climate control in greenhouses
    Morales-Garcia, Juan
    Bueno-Crespo, Andres
    Terroso-Saenz, Fernando
    Arcas-Tunez, Francisco
    Martinez-Espana, Raquel
    Cecilia, Jose M.
    APPLIED INTELLIGENCE, 2023, 53 (21) : 24765 - 24781
  • [44] Evaluation of synthetic data generation for intelligent climate control in greenhouses
    Juan Morales-García
    Andrés Bueno-Crespo
    Fernando Terroso-Sáenz
    Francisco Arcas-Túnez
    Raquel Martínez-España
    José M. Cecilia
    Applied Intelligence, 2023, 53 : 24765 - 24781
  • [45] Traffic Flow Data Mining and Evaluation Based on Fuzzy Clustering Techniques
    Hu Chunchun
    Luo Nianxue
    Yan Xiaohong
    Shi Wenzhong
    INTERNATIONAL JOURNAL OF FUZZY SYSTEMS, 2011, 13 (04) : 344 - 349
  • [46] Enhancing privacy of anonymous location sampling techniques in traffic monitoring systems
    Ho, Baik
    Gruteser, Marco
    Xiong, Hui
    Alrabady, Ansaf
    2006 SECURECOMM AND WORKSHOPS, 2006, : 254 - +
  • [47] Evaluation of Techniques for Signature Classification from Accelerometer and Gyroscope data
    Tencer, Lukas
    Reznakova, Marta
    Cheriet, Mohamed
    2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 1066 - 1070
  • [48] Evaluation of oversampling data balancing techniques in the context of ordinal classification
    Domingues, Ines
    Amorim, Jose P.
    Abreu, Pedro H.
    Duarte, Hugo
    Santos, Joao
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [49] Evaluation of Synthetic Categorical Data Generation Techniques for Predicting Cardiovascular Diseases and Post-Hoc Interpretability of the Risk Factors
    Garcia-Vicente, Clara
    Chushig-Muzo, David
    Mora-Jimenez, Inmaculada
    Fabelo, Himar
    Gram, Inger Torhild
    Lochen, Maja-Lisa
    Granja, Conceicao
    Soguero-Ruiz, Cristina
    APPLIED SCIENCES-BASEL, 2023, 13 (07):
  • [50] Detection of Human Traffic Controllers Wearing Construction Workwear via Synthetic Data Generation
    Baik, Seunghyun
    Kim, Euntai
    SENSORS, 2025, 25 (03)