Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

被引：17

作者：

Wojciechowski S. ^{[1
]}

Wilk S. ^{[1
]}

机构：

[1] Institute of Computing Science, Poznan University of Technology, Piotrowo 2, Poznan

来源：

| 1600年 / Walter de Gruyter GmbH卷 / 42期

关键词：

difficulty factors; imbalanced data; learning and classification; preprocessing methods;

D O I：

10.1515/fcds-2017-0007

中图分类号：

学科分类号：

摘要：

In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN. © by Szymon Wilk 2017.

引用

页码：149 / 176

页数：27

共 50 条

[21] The Text Classification for Imbalanced Data Sets
Li, Yanling
Zhu, Yehang
Yang, Ping
ISISE 2008: INTERNATIONAL SYMPOSIUM ON INFORMATION SCIENCE AND ENGINEERING, VOL 2, 2008, : 778 - +
[22] The study of preprocessing methods' utility in analysis of multidimensional and highly imbalanced medical data
Werner, Aleksandra
Bach, Malgorzata
Pluskiewicz, Wojciech
PROCEEDINGS OF THE 11TH SCIENTIFIC CONFERENCE INTERNET IN THE INFORMATION SOCIETY 2016, 2016, : 71 - 87
[23] An empirical study of the behavior of classifiers on imbalanced and overlapped data sets
Garcia, Vicente
Sanchez, Jose
Mollineda, Ramon
PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS AND APPLICATIONS, PROCEEDINGS, 2007, 4756 : 397 - +
[24] Preprocessing of Experimental Seismic Data
Allakhverdiyeva, Naila
2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT), 2014, : 207 - 210
[25] Evaluating Difficulty of Multi-class Imbalanced Data
Lango, Mateusz
Napierala, Krystyna
Stefanowski, Jerzy
FOUNDATIONS OF INTELLIGENT SYSTEMS, ISMIS 2017, 2017, 10352 : 312 - 322
[26] Creation, population and preprocessing of experimental data sets for evaluation of applications for the semantic web
Frivolt, Gyoergy
Suchal, Jan
Vesely, Richard
Vojtek, Peter
Vozar, Oto
Bielikova, Maria
SOFSEM 2008: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2008, 4910 : 684 - 695
[27] Rough sets method for SVM data preprocessing
Li, Y
Cai, YZ
Li, YG
Xu, XM
2004 IEEE CONFERENCE ON CYBERNETICS AND INTELLIGENT SYSTEMS, VOLS 1 AND 2, 2004, : 1039 - 1042
[28] NEW HYBRID DATA PREPROCESSING TECHNIQUE FOR HIGHLY IMBALANCED DATASET
Malik, Esraa Faisal
Khaw, Khai Wah
Chew, XinYing
COMPUTING AND INFORMATICS, 2022, 41 (04) : 981 - 1001
[29] An Algorithm for Selective Preprocessing of Multi-class Imbalanced Data
Wojciechowski, Szymon
Wilk, Szymon
Stefanowski, Jerzy
PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON COMPUTER RECOGNITION SYSTEMS CORES 2017, 2018, 578 : 238 - 247
[30] Data Preprocessing for DES-KNN and Its Application to Imbalanced Medical Data Classification
Kinal, Maciej
Wozniak, Michal
INTELLIGENT INFORMATION AND DATABASE SYSTEMS (ACIIDS 2020), PT I, 2020, 12033 : 589 - 599

← 1 2 3 4 5 →