Evaluation of Synthetic Data Generation Techniques in the Domain of Anonymous Traffic Classification

被引:4
|
作者
Cullen, Drake [1 ]
Halladay, James [1 ]
Briner, Nathan [1 ]
Basnet, Ram [1 ]
Bergen, Jeremy [1 ]
Doleck, Tenzin [2 ]
机构
[1] Colorado Mesa Univ CMU, Dept Comp Sci & Engn, Grand Junction, CO 81501 USA
[2] Simon Fraser Univ, Fac Educ, Burnaby, BC V5A 1S6, Canada
来源
IEEE ACCESS | 2022年 / 10卷
关键词
Anonymous traffic; synthetic data; CopulaGAN; CTGAN; SMOTE; VAE; TabNet; deep learning; machine learning; unbalanced data;
D O I
10.1109/ACCESS.2022.3228507
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Anonymous network traffic is more pervasive than ever due to the accessibility of services such as virtual private networks (VPN) and The Onion Router (Tor). To address the need to identify and classify this traffic, machine and deep learning solutions have become the standard. However, high performing classifiers often scale poorly when applied to real-world traffic classification due to the heavily skewed nature of network traffic data. Prior research has found synthetic data generation to be effective at alleviating concerns surrounding class imbalance, though a limited number of these techniques have been applied to the domain of anonymous network traffic detection. This work compares the ability of a Conditional Tabular Generative Adversarial Network (CTGAN), Copula Generative Adversarial Network (CopulaGAN), Variational Autoencoder (VAE), and Synthetic Minority Over-sampling Technique (SMOTE) to create viable synthetic anonymous network traffic samples. Moreover, we evaluate the performance of several shallow boosting and bagging classifiers as well as deep learning models on the synthetic data. Ultimately, we amalgamate the data generated by the GANs, VAE, and SMOTE into a comprehensive dataset dubbed CMU-SynTraffic-2022 for future research on this topic. Our findings show that SMOTE consistently outperformed the other upsampling techniques, improving classifiers' F1-scores over the control by similar to 7.5% for application type characterization. Among the tested classifiers, Light Gradient Boosting Machine achieved the highest F1-score of 90.3% on eight application types.
引用
收藏
页码:129612 / 129625
页数:14
相关论文
共 50 条
  • [1] Evaluation of Synthetic Data Generation Techniques in the Domain of Anonymous Traffic Classification
    Cullen, Drake
    Halladay, James
    Briner, Nathan
    Basnet, Ram
    Bergen, Jeremy
    Doleck, Tenzin
    IEEE Access, 2022, 10 : 129612 - 129625
  • [2] Synthetic Data Generation and Evaluation Techniques for Classifiers in Data Starved Medical Applications
    Bae, Wan D.
    Alkobaisi, Shayma
    Horak, Matthew
    Bankar, Siddheshwari
    Bhuvaji, Sartaj
    Kim, Sungroul
    Park, Choon-Sik
    IEEE ACCESS, 2025, 13 : 16584 - 16602
  • [3] Synthetic traffic generation techniques for ATM network simulations
    Williamson, C
    SIMULATION, 1999, 72 (05) : 305 - 312
  • [4] Synthetic Generation of Traffic Data for Urban Mobility
    Sapre, Varun
    Kalambur, Subramaniam
    Sitaram, Dinkar
    Bastian, Rohit
    2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2018, : 2151 - 2157
  • [5] Incorporation of Synthetic Data Generation Techniques within a Controlled Data Processing Workflow in the Health and Wellbeing Domain
    Hernandez, Mikel
    Epelde, Gorka
    Beristain, Andoni
    Alvarez, Roberto
    Molina, Cristina
    Larrea, Xabat
    Alberdi, Ane
    Timoleon, Michalis
    Bamidis, Panagiotis
    Konstantinidis, Evdokimos
    ELECTRONICS, 2022, 11 (05)
  • [6] A Software Framework for Synthetic Aeronautical Data Traffic Generation in Support of LDACS Evaluation Activities
    Jansen, Leonardus J. A.
    Graeupl, Thomas
    Maeurer, Nils
    Morioka, Kazuyuki
    Schmitt, Corinna
    2023 INTEGRATED COMMUNICATION, NAVIGATION AND SURVEILLANCE CONFERENCE, ICNS, 2023,
  • [7] SYNTHETIC DATA GENERATION AND CLASSIFICATION OF HISTOPATHOLOGICAL IMAGES
    Derus, Nicolas
    Curti, Nico
    Giampieri, Enrico
    Dall'olio, Daniele
    Sala, Claudia
    Castellani, Gastone
    JOURNAL OF MECHANICS IN MEDICINE AND BIOLOGY, 2023, 23 (06)
  • [8] Synthetic Traffic Generation as a Tool for Dynamic Interconnect Evaluation
    Heirman, Wim
    Dambre, Joni
    Van Campenhout, Jan
    PROCEEDINGS OF SLIP '07: 2007 INTERNATIONAL WORKSHOP ON SYSTEM LEVEL INTERCONNECT PREDICTION, 2007, : 65 - 72
  • [9] Generation and evaluation of synthetic patient data
    Goncalves, Andre
    Ray, Priyadip
    Soper, Braden
    Stevens, Jennifer
    Coyle, Linda
    Sales, Ana Paula
    BMC MEDICAL RESEARCH METHODOLOGY, 2020, 20 (01)
  • [10] Generation and evaluation of synthetic patient data
    Andre Goncalves
    Priyadip Ray
    Braden Soper
    Jennifer Stevens
    Linda Coyle
    Ana Paula Sales
    BMC Medical Research Methodology, 20