Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

被引:17
|
作者
Wojciechowski S. [1 ]
Wilk S. [1 ]
机构
[1] Institute of Computing Science, Poznan University of Technology, Piotrowo 2, Poznan
来源
| 1600年 / Walter de Gruyter GmbH卷 / 42期
关键词
difficulty factors; imbalanced data; learning and classification; preprocessing methods;
D O I
10.1515/fcds-2017-0007
中图分类号
学科分类号
摘要
In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN. © by Szymon Wilk 2017.
引用
收藏
页码:149 / 176
页数:27
相关论文
共 50 条
  • [41] Data Preprocessing and Classification for Taproot Site Data Sets of PANAX NOTOGINSENG
    Huang, Dao
    He, Jin
    PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON MODELLING, IDENTIFICATION AND CONTROL, 2015, 119 : 131 - 134
  • [42] DIFFICULTY FACTORS IN BINARY DATA
    MCDONALD, RP
    AHLAWAT, KS
    BRITISH JOURNAL OF MATHEMATICAL & STATISTICAL PSYCHOLOGY, 1974, 27 (MAY): : 82 - 99
  • [43] Data Augmentation Meta-Classifier Scheme for imbalanced data sets
    Moreno-Barea, Francisco J.
    Jerez, Jose M.
    Franco, Leonardo
    2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 1392 - 1399
  • [44] A Hybrid Model Based on Samples Difficulty for Imbalanced Data Classification
    Shan, Ao
    Chung, Yeh-Ching
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT I, 2023, 14254 : 26 - 37
  • [45] Study on the Method of Data Preprocessing for QAR Data
    Wang Hong
    Huan Xiuxia
    RECENT ADVANCE IN STATISTICS APPLICATION AND RELATED AREAS, PTS 1 AND 2, 2008, : 241 - 244
  • [46] Data Preprocessing for ANN-based Industrial Time-Series Forecasting with Imbalanced Data
    Pisa, Ivan
    Santin, Ignacio
    Lopez Vicario, Jose
    Morell, Antoni
    Vilanova, Ramon
    2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2019,
  • [47] Goal-Driven On-Line Imbalanced Streaming Data Preprocessing
    Lu, Ching-Hu
    Yu, Chun-Hsien
    Chen, Chang-Ru
    Huang, Shih-Shinh
    2018 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS-TAIWAN (ICCE-TW), 2018,
  • [48] A preprocessing method combined with an ensemble framework for the multiclass imbalanced data classification
    Pavan Kumar M.R.
    Jayagopal P.
    International Journal of Computers and Applications, 2022, 44 (12) : 1178 - 1185
  • [49] Preprocessing method based on sample resampling for imbalanced data of electronic circuits
    Li R.
    Xu A.
    Sun W.
    Wu Y.
    Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2020, 42 (11): : 2654 - 2660
  • [50] Handling imbalanced data sets with a modification of Decorate algorithm
    Kotsiantis, Sotiris B.
    INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS IN TECHNOLOGY, 2008, 33 (2-3) : 91 - 98