Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

被引：17

作者：

Wojciechowski S. ^{[1
]}

Wilk S. ^{[1
]}

机构：

[1] Institute of Computing Science, Poznan University of Technology, Piotrowo 2, Poznan

来源：

| 1600年 / Walter de Gruyter GmbH卷 / 42期

关键词：

difficulty factors; imbalanced data; learning and classification; preprocessing methods;

D O I：

10.1515/fcds-2017-0007

中图分类号：

学科分类号：

摘要：

In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN. © by Szymon Wilk 2017.

引用

页码：149 / 176

页数：27

共 50 条

[41] Data Preprocessing and Classification for Taproot Site Data Sets of PANAX NOTOGINSENG
Huang, Dao
He, Jin
PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON MODELLING, IDENTIFICATION AND CONTROL, 2015, 119 : 131 - 134
[42] DIFFICULTY FACTORS IN BINARY DATA
MCDONALD, RP
AHLAWAT, KS
BRITISH JOURNAL OF MATHEMATICAL & STATISTICAL PSYCHOLOGY, 1974, 27 (MAY): : 82 - 99
[43] Data Augmentation Meta-Classifier Scheme for imbalanced data sets
Moreno-Barea, Francisco J.
Jerez, Jose M.
Franco, Leonardo
2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 1392 - 1399
[44] A Hybrid Model Based on Samples Difficulty for Imbalanced Data Classification
Shan, Ao
Chung, Yeh-Ching
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT I, 2023, 14254 : 26 - 37
[45] Study on the Method of Data Preprocessing for QAR Data
Wang Hong
Huan Xiuxia
RECENT ADVANCE IN STATISTICS APPLICATION AND RELATED AREAS, PTS 1 AND 2, 2008, : 241 - 244
[46] Data Preprocessing for ANN-based Industrial Time-Series Forecasting with Imbalanced Data
Pisa, Ivan
Santin, Ignacio
Lopez Vicario, Jose
Morell, Antoni
Vilanova, Ramon
2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2019,
[47] Goal-Driven On-Line Imbalanced Streaming Data Preprocessing
Lu, Ching-Hu
Yu, Chun-Hsien
Chen, Chang-Ru
Huang, Shih-Shinh
2018 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS-TAIWAN (ICCE-TW), 2018,
[48] A preprocessing method combined with an ensemble framework for the multiclass imbalanced data classification
Pavan Kumar M.R.
Jayagopal P.
International Journal of Computers and Applications, 2022, 44 (12) : 1178 - 1185
[49] Preprocessing method based on sample resampling for imbalanced data of electronic circuits
Li R.
Xu A.
Sun W.
Wu Y.
Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2020, 42 (11): : 2654 - 2660
[50] Handling imbalanced data sets with a modification of Decorate algorithm
Kotsiantis, Sotiris B.
INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS IN TECHNOLOGY, 2008, 33 (2-3) : 91 - 98

← 1 2 3 4 5 →