Dealing with Class Noise in Large Training Datasets for Malware Detection

被引：4

作者：

Gavrilut, Dragos ^{[1
,2
]}

Ciortuz, Liviu ^{[1
]}

机构：

[1] Alexandru Ioan Cuza Univ, Fac Comp Sci, Iasi, Romania

[2] BitDefender Antivirus Res Lab, Iasi, Romania

来源：

13TH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING (SYNASC 2011) | 2012年

关键词：

Malware detection; perceptrons; class noise; CLASSIFICATION;

D O I：

10.1109/SYNASC.2011.39

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

This paper presents the ways we explored until now for detecting and dealing with the class noise found in large annotated datasets used for training the classifiers that we have previously designed for industrial-scale malware identification. First we established a number of distance-based filtering rules that allow us to identify different "levels" of potential noise in the training data, and secondly we analysed the effects produced by either removal or "cleaning" of the potentially-noised records on the performances of our simplest classifiers. We show that a careful distance-based filtering can lead to sensibly better results in malware detection.

引用

页码：401 / 407

页数：7

共 50 条

[1] Detection and Elimination of Class Noise in Large Datasets using Partitioning Filter Technique
Zerhari, Btissam
Lahcen, Ayoub Ait
Mouline, Salma
2016 4TH IEEE INTERNATIONAL COLLOQUIUM ON INFORMATION SCIENCE AND TECHNOLOGY (CIST), 2016, : 194 - 199
[2] One side class SVM training methods for malware detection
Popoiu, George
2022 24TH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING, SYNASC, 2022, : 359 - 364
[3] Dealing with Contaminated Datasets: an Approach to Classifier Training
Homenda, Wladyslaw
Jastrzebska, Agnieszka
Rybnik, Mariusz
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NUMERICAL ANALYSIS AND APPLIED MATHEMATICS 2015 (ICNAAM-2015), 2016, 1738
[4] Class Noise Elimination Approach for Large Datasets Based on a Combination of Classifiers
Zerhari, Btissam
2016 2ND INTERNATIONAL CONFERENCE ON CLOUD COMPUTING TECHNOLOGIES AND APPLICATIONS (CLOUDTECH), 2016, : 125 - 130
[5] Dealing with Randomness and Concept Drift in Large Datasets
Mwitondi, Kassim S.
Said, Raed A.
DATA, 2021, 6 (07)
[6] Performance Comparison of Training Datasets for System Call-Based Malware Detection with Thread Information
Kajiwara, Yuki
Zheng, Junjun
Mouri, Koichi
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2021, E104D (12) : 2173 - 2183
[7] Research on the Construction of Malware Variant Datasets and Their Detection Method
Lu, Faming
Cai, Zhaoyang
Lin, Zedong
Bao, Yunxia
Tang, Mengfan
APPLIED SCIENCES-BASEL, 2022, 12 (15):
[8] Possibilities and Pitfalls for Dealing with Large Longitudinal Qualitative Datasets
Devonald, Megan
Jones, Nicola
INTERNATIONAL JOURNAL OF QUALITATIVE METHODS, 2023, 22
[9] Fast SVM training using edge detection on very large datasets
Li, Boyang
Wang, Qiangwei
Hu, Jinglu
IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2013, 8 (03) : 229 - 237
[10] Are Your Training Datasets Yet Relevant? An Investigation into the Importance of Timeline in Machine Learning-Based Malware Detection
Allix, Kevin
Bissyande, Tegawende F.
Klein, Jacques
Le Traon, Yves
ENGINEERING SECURE SOFTWARE AND SYSTEMS (ESSOS 2015), 2015, 8978 : 51 - 67

← 1 2 3 4 5 →