Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

被引:17
|
作者
Wojciechowski S. [1 ]
Wilk S. [1 ]
机构
[1] Institute of Computing Science, Poznan University of Technology, Piotrowo 2, Poznan
来源
| 1600年 / Walter de Gruyter GmbH卷 / 42期
关键词
difficulty factors; imbalanced data; learning and classification; preprocessing methods;
D O I
10.1515/fcds-2017-0007
中图分类号
学科分类号
摘要
In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN. © by Szymon Wilk 2017.
引用
收藏
页码:149 / 176
页数:27
相关论文
共 50 条
  • [1] Application of Preprocessing Methods to Imbalanced Clinical Data: An Experimental Study
    Wilk, Szymon
    Stefanowski, Jerzy
    Wojciechowski, Szymon
    Farion, Ken J.
    Michalowski, Wojtek
    INFORMATION TECHNOLOGIES IN MEDICINE, ITIB 2016, VOL 1, 2016, 471 : 503 - 515
  • [2] The impact of data difficulty factors on classification of imbalanced and concept drifting data streams
    Dariusz Brzezinski
    Leandro L. Minku
    Tomasz Pewinski
    Jerzy Stefanowski
    Artur Szumaczuk
    Knowledge and Information Systems, 2021, 63 : 1429 - 1469
  • [3] Addressing Data-Complexity for Imbalanced Data-sets: A Preliminary Study on the Use of Preprocessing for C4.5
    Luengo, Julian
    Fernandez, Alberto
    Herrera, Francisco
    Garcia, Salvador
    2009 9TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, 2009, : 523 - +
  • [4] The impact of data difficulty factors on classification of imbalanced and concept drifting data streams
    Brzezinski, Dariusz
    Minku, Leandro L.
    Pewinski, Tomasz
    Stefanowski, Jerzy
    Szumaczuk, Artur
    KNOWLEDGE AND INFORMATION SYSTEMS, 2021, 63 (06) : 1429 - 1469
  • [5] Time series transductive classification on imbalanced data sets: an experimental study
    de Sousa, Celso A. R.
    Souza, Vinicius M. A.
    Batista, Gustavo E. A. P. A.
    2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 3780 - 3785
  • [6] Data Mining on Imbalanced Data Sets
    Gu, Qiong
    Cai, Zhihua
    Zhu, Li
    Huang, Bo
    2008 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER THEORY AND ENGINEERING, 2008, : 1020 - 1024
  • [7] Imbalanced Data Stream Classification Using Hybrid Data Preprocessing
    Bobowska, Barbara
    Klikowski, Jakub
    Wozniak, Michal
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT II, 2020, 1168 : 402 - 413
  • [8] A comparative study on noise filtering of imbalanced data sets
    Szeghalmy, Szilvia
    Fazekas, Attila
    KNOWLEDGE-BASED SYSTEMS, 2024, 301
  • [9] Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
    de Vargas, Vitor Werner
    Schneider Aranda, Jorge Arthur
    Costa, Ricardo dos Santos
    da Silva Pereira, Paulo Ricardo
    Victoria Barbosa, Jorge Luis
    KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 65 (01) : 31 - 57
  • [10] Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
    Vitor Werner de Vargas
    Jorge Arthur Schneider Aranda
    Ricardo dos Santos Costa
    Paulo Ricardo da Silva Pereira
    Jorge Luis Victória Barbosa
    Knowledge and Information Systems, 2023, 65 : 31 - 57