Evaluation of Synthetic Data Generation Techniques in the Domain of Anonymous Traffic Classification

被引:4
|
作者
Cullen, Drake [1 ]
Halladay, James [1 ]
Briner, Nathan [1 ]
Basnet, Ram [1 ]
Bergen, Jeremy [1 ]
Doleck, Tenzin [2 ]
机构
[1] Colorado Mesa Univ CMU, Dept Comp Sci & Engn, Grand Junction, CO 81501 USA
[2] Simon Fraser Univ, Fac Educ, Burnaby, BC V5A 1S6, Canada
来源
IEEE ACCESS | 2022年 / 10卷
关键词
Anonymous traffic; synthetic data; CopulaGAN; CTGAN; SMOTE; VAE; TabNet; deep learning; machine learning; unbalanced data;
D O I
10.1109/ACCESS.2022.3228507
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Anonymous network traffic is more pervasive than ever due to the accessibility of services such as virtual private networks (VPN) and The Onion Router (Tor). To address the need to identify and classify this traffic, machine and deep learning solutions have become the standard. However, high performing classifiers often scale poorly when applied to real-world traffic classification due to the heavily skewed nature of network traffic data. Prior research has found synthetic data generation to be effective at alleviating concerns surrounding class imbalance, though a limited number of these techniques have been applied to the domain of anonymous network traffic detection. This work compares the ability of a Conditional Tabular Generative Adversarial Network (CTGAN), Copula Generative Adversarial Network (CopulaGAN), Variational Autoencoder (VAE), and Synthetic Minority Over-sampling Technique (SMOTE) to create viable synthetic anonymous network traffic samples. Moreover, we evaluate the performance of several shallow boosting and bagging classifiers as well as deep learning models on the synthetic data. Ultimately, we amalgamate the data generated by the GANs, VAE, and SMOTE into a comprehensive dataset dubbed CMU-SynTraffic-2022 for future research on this topic. Our findings show that SMOTE consistently outperformed the other upsampling techniques, improving classifiers' F1-scores over the control by similar to 7.5% for application type characterization. Among the tested classifiers, Light Gradient Boosting Machine achieved the highest F1-score of 90.3% on eight application types.
引用
收藏
页码:129612 / 129625
页数:14
相关论文
共 50 条
  • [11] Generation and evaluation of medical synthetic data
    Goncalves, Andre R.
    Ray, Priyadip
    Soper, Braden
    Myneni, Madhumita
    Stevens, Jennifer L.
    Coyle, Linda M.
    Sales, Ana Paula
    CANCER RESEARCH, 2019, 79 (13)
  • [12] Machine learning for anonymous traffic detection and classification
    Akshobhya, K. M.
    2021 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE & ENGINEERING (CONFLUENCE 2021), 2021, : 942 - 947
  • [13] A Hierarchical Classification Approach for Tor Anonymous Traffic
    Jia Lingyu
    Liu Yang
    Wang Bailing
    Liu Hongri
    Xin Guodong
    2017 IEEE 9TH INTERNATIONAL CONFERENCE ON COMMUNICATION SOFTWARE AND NETWORKS (ICCSN), 2017, : 239 - 243
  • [14] Synthetic Network Traffic Data Generation and Classification of Advanced Persistent Threat Samples: A Case Study with GANs and XGBoost
    Anande, T. J.
    Leeson, M. S.
    DEEP LEARNING THEORY AND APPLICATIONS, DELTA 2023, 2023, 1875 : 1 - 18
  • [15] IPTV over EPON: Synthetic traffic generation and performance evaluation
    Bhaumik, Partha
    Reaz, Abu Sayeem
    Murayama, Daisuke
    Suzuki, Ken-Ichi
    Yoshimoto, Naoto
    Kramer, Glen
    Mukherjee, Biswanath
    OPTICAL SWITCHING AND NETWORKING, 2015, 18 : 180 - 190
  • [16] An Evaluation Framework for Synthetic Data Generation Models
    Livieris, I. E.
    Alimpertis, N.
    Domalis, G.
    Tsakalidis, D.
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, PT III, AIAI 2024, 2024, 713 : 320 - 335
  • [17] Synthetic data in medicine: generation, evaluation and limits
    Benani, Alaedine
    Vibert, Julien
    Demuth, Stanislas
    M S-MEDECINE SCIENCES, 2024, 40 (8-9): : 661 - 664
  • [18] Evaluation of Domain Specific Data Augmentation Techniques for the Classification of Celiac Disease using Endoscopic Imagery
    Wimmer, Georg
    Uhl, Andreas
    Vecsei, Andreas
    2017 IEEE 19TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2017,
  • [19] Generation of Boat Traffic Data: Techniques for Temporal and Spatial Extrapolation
    Ozeren, Yavuz
    Simon, Andrew
    Altinakar, Mustafa
    WORLD ENVIRONMENTAL AND WATER RESOURCES CONGRESS 2016: HYDRAULICS AND WATERWAYS AND HYDRO-CLIMATE/CLIMATE CHANGE, 2016, : 245 - 254
  • [20] Domain Knowledge-Driven Generation of Synthetic Healthcare Data
    Hashemi, Atiye Sadat
    Soliman, Amira
    Lundstrom, Jens
    Etminani, Kobra
    CARING IS SHARING-EXPLOITING THE VALUE IN DATA FOR HEALTH AND INNOVATION-PROCEEDINGS OF MIE 2023, 2023, 302 : 352 - 353