Dealing with Class Noise in Large Training Datasets for Malware Detection

被引:4
|
作者
Gavrilut, Dragos [1 ,2 ]
Ciortuz, Liviu [1 ]
机构
[1] Alexandru Ioan Cuza Univ, Fac Comp Sci, Iasi, Romania
[2] BitDefender Antivirus Res Lab, Iasi, Romania
关键词
Malware detection; perceptrons; class noise; CLASSIFICATION;
D O I
10.1109/SYNASC.2011.39
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper presents the ways we explored until now for detecting and dealing with the class noise found in large annotated datasets used for training the classifiers that we have previously designed for industrial-scale malware identification. First we established a number of distance-based filtering rules that allow us to identify different "levels" of potential noise in the training data, and secondly we analysed the effects produced by either removal or "cleaning" of the potentially-noised records on the performances of our simplest classifiers. We show that a careful distance-based filtering can lead to sensibly better results in malware detection.
引用
收藏
页码:401 / 407
页数:7
相关论文
共 50 条
  • [1] Detection and Elimination of Class Noise in Large Datasets using Partitioning Filter Technique
    Zerhari, Btissam
    Lahcen, Ayoub Ait
    Mouline, Salma
    2016 4TH IEEE INTERNATIONAL COLLOQUIUM ON INFORMATION SCIENCE AND TECHNOLOGY (CIST), 2016, : 194 - 199
  • [2] One side class SVM training methods for malware detection
    Popoiu, George
    2022 24TH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING, SYNASC, 2022, : 359 - 364
  • [3] Dealing with Contaminated Datasets: an Approach to Classifier Training
    Homenda, Wladyslaw
    Jastrzebska, Agnieszka
    Rybnik, Mariusz
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NUMERICAL ANALYSIS AND APPLIED MATHEMATICS 2015 (ICNAAM-2015), 2016, 1738
  • [4] Class Noise Elimination Approach for Large Datasets Based on a Combination of Classifiers
    Zerhari, Btissam
    2016 2ND INTERNATIONAL CONFERENCE ON CLOUD COMPUTING TECHNOLOGIES AND APPLICATIONS (CLOUDTECH), 2016, : 125 - 130
  • [5] Dealing with Randomness and Concept Drift in Large Datasets
    Mwitondi, Kassim S.
    Said, Raed A.
    DATA, 2021, 6 (07)
  • [6] Performance Comparison of Training Datasets for System Call-Based Malware Detection with Thread Information
    Kajiwara, Yuki
    Zheng, Junjun
    Mouri, Koichi
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2021, E104D (12) : 2173 - 2183
  • [7] Research on the Construction of Malware Variant Datasets and Their Detection Method
    Lu, Faming
    Cai, Zhaoyang
    Lin, Zedong
    Bao, Yunxia
    Tang, Mengfan
    APPLIED SCIENCES-BASEL, 2022, 12 (15):
  • [8] Possibilities and Pitfalls for Dealing with Large Longitudinal Qualitative Datasets
    Devonald, Megan
    Jones, Nicola
    INTERNATIONAL JOURNAL OF QUALITATIVE METHODS, 2023, 22
  • [9] Fast SVM training using edge detection on very large datasets
    Li, Boyang
    Wang, Qiangwei
    Hu, Jinglu
    IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2013, 8 (03) : 229 - 237
  • [10] Are Your Training Datasets Yet Relevant? An Investigation into the Importance of Timeline in Machine Learning-Based Malware Detection
    Allix, Kevin
    Bissyande, Tegawende F.
    Klein, Jacques
    Le Traon, Yves
    ENGINEERING SECURE SOFTWARE AND SYSTEMS (ESSOS 2015), 2015, 8978 : 51 - 67