Evaluation of Synthetic Data Generation Techniques in the Domain of Anonymous Traffic Classification

被引:4
|
作者
Cullen, Drake [1 ]
Halladay, James [1 ]
Briner, Nathan [1 ]
Basnet, Ram [1 ]
Bergen, Jeremy [1 ]
Doleck, Tenzin [2 ]
机构
[1] Colorado Mesa Univ CMU, Dept Comp Sci & Engn, Grand Junction, CO 81501 USA
[2] Simon Fraser Univ, Fac Educ, Burnaby, BC V5A 1S6, Canada
来源
IEEE ACCESS | 2022年 / 10卷
关键词
Anonymous traffic; synthetic data; CopulaGAN; CTGAN; SMOTE; VAE; TabNet; deep learning; machine learning; unbalanced data;
D O I
10.1109/ACCESS.2022.3228507
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Anonymous network traffic is more pervasive than ever due to the accessibility of services such as virtual private networks (VPN) and The Onion Router (Tor). To address the need to identify and classify this traffic, machine and deep learning solutions have become the standard. However, high performing classifiers often scale poorly when applied to real-world traffic classification due to the heavily skewed nature of network traffic data. Prior research has found synthetic data generation to be effective at alleviating concerns surrounding class imbalance, though a limited number of these techniques have been applied to the domain of anonymous network traffic detection. This work compares the ability of a Conditional Tabular Generative Adversarial Network (CTGAN), Copula Generative Adversarial Network (CopulaGAN), Variational Autoencoder (VAE), and Synthetic Minority Over-sampling Technique (SMOTE) to create viable synthetic anonymous network traffic samples. Moreover, we evaluate the performance of several shallow boosting and bagging classifiers as well as deep learning models on the synthetic data. Ultimately, we amalgamate the data generated by the GANs, VAE, and SMOTE into a comprehensive dataset dubbed CMU-SynTraffic-2022 for future research on this topic. Our findings show that SMOTE consistently outperformed the other upsampling techniques, improving classifiers' F1-scores over the control by similar to 7.5% for application type characterization. Among the tested classifiers, Light Gradient Boosting Machine achieved the highest F1-score of 90.3% on eight application types.
引用
收藏
页码:129612 / 129625
页数:14
相关论文
共 50 条
  • [21] Synthetic data generation: State of the art in health care domain
    Murtaza, Hajra
    Ahmed, Musharif
    Khan, Naurin Farooq
    Murtaza, Ghulam
    Zafar, Saad
    Bano, Ambreen
    COMPUTER SCIENCE REVIEW, 2023, 48
  • [22] Synthetic Data generation using DCGAN for improved traffic sign recognition
    Christine Dewi
    Rung-Ching Chen
    Yan-Ting Liu
    Shao-Kuo Tai
    Neural Computing and Applications, 2022, 34 : 21465 - 21480
  • [23] Synthetic Generation of Multidimensional Data to Improve Classification Model Validity
    Al-Qerem, Ahmad
    Ali, Ali Mohd
    Attar, Hani
    Nashwan, Shadi
    Qi, Lianyong
    Moghimi, Mohammad Kazem
    Solyman, Ahmed
    ACM JOURNAL OF DATA AND INFORMATION QUALITY, 2023, 15 (03):
  • [24] Synthetic Data generation using DCGAN for improved traffic sign recognition
    Dewi, Christine
    Chen, Rung-Ching
    Liu, Yan-Ting
    Tai, Shao-Kuo
    NEURAL COMPUTING & APPLICATIONS, 2022, 34 (24): : 21465 - 21480
  • [25] TRAFFIC ANALYSIS AND SYNTHETIC SCENARIO GENERATION FOR ATM OPERATIONAL CONCEPTS EVALUATION
    Besada, Juan A.
    Portillo, Javier
    de Miguel, Gonzalo
    de Andrea, Rafael
    Canino, Jose M.
    2009 IEEE/AIAA 28TH DIGITAL AVIONICS SYSTEMS CONFERENCE, VOLS 1-3, 2009, : 222 - +
  • [26] Generation and evaluation of synthetic cone penetration test (CPT) data using various spatial interpolation techniques
    Rahman, Md Habibur
    Abu-Farsakh, Murad Y.
    Jafari, Navid
    CANADIAN GEOTECHNICAL JOURNAL, 2021, 58 (02) : 224 - 237
  • [27] Generation and evaluation of privacy preserving synthetic health data
    Yale, Andrew
    Dash, Saloni
    Dutta, Ritik
    Guyon, Isabelle
    Pavao, Adrien
    Bennett, Kristin P.
    NEUROCOMPUTING, 2020, 416 : 244 - 255
  • [28] Survey on Synthetic Data Generation, Evaluation Methods and GANs
    Figueira, Alvaro
    Vaz, Bruno
    MATHEMATICS, 2022, 10 (15)
  • [29] Scheduling Techniques Evaluation in LTE systems with Mixed Data Traffic
    Balint, Cornel
    Budura, Georgeta
    Marza, Eugen
    2010 9TH INTERNATIONAL SYMPOSIUM ON ELECTRONICS AND TELECOMMUNICATIONS (ISETC), 2010, : 221 - 224
  • [30] National traffic system evaluation using data mining techniques
    Chang, ECP
    DATA MINING AND KNOWLEDGE DISCOVERY: THEORY, TOOLS, AND TECHNOLOGY II, 2000, 4057 : 409 - 411