Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

被引:17
|
作者
Wojciechowski S. [1 ]
Wilk S. [1 ]
机构
[1] Institute of Computing Science, Poznan University of Technology, Piotrowo 2, Poznan
来源
| 1600年 / Walter de Gruyter GmbH卷 / 42期
关键词
difficulty factors; imbalanced data; learning and classification; preprocessing methods;
D O I
10.1515/fcds-2017-0007
中图分类号
学科分类号
摘要
In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN. © by Szymon Wilk 2017.
引用
收藏
页码:149 / 176
页数:27
相关论文
共 50 条
  • [31] Research on imbalanced data set preprocessing based on deep learning
    Wang Fangyu
    Zhang Jianhui
    Bu Youjun
    Chen Bo
    2021 ASIA-PACIFIC CONFERENCE ON COMMUNICATIONS TECHNOLOGY AND COMPUTER SCIENCE (ACCTCS 2021), 2021, : 75 - 79
  • [32] Balanced Neighborhood Classifiers for Imbalanced Data Sets
    Zhu, Shunzhi
    Ma, Ying
    Pan, Weiwei
    Zhu, Xiatian
    Luo, Guangchun
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2014, E97D (12): : 3226 - 3229
  • [33] Classification with local clustering in imbalanced data sets
    Ji, Hua
    Zhang, Huaxiang
    ADVANCED RESEARCH ON INFORMATION SCIENCE, AUTOMATION AND MATERIAL SYSTEM, PTS 1-6, 2011, 219-220 : 151 - 155
  • [34] An evaluation of progressive sampling for imbalanced data sets
    Ng, Willie
    Dash, Manoranjan
    ICDM 2006: SIXTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, WORKSHOPS, 2006, : 657 - +
  • [35] Analysis of Data Preprocessing Increasing the Oversampling Ratio for Extremely Imbalanced Big Data Classification
    del Rio, Sara
    Benitez, Jose M.
    Herrera, Francisco
    2015 IEEE TRUSTCOM/BIGDATASE/ISPA, VOL 2, 2015, : 180 - 185
  • [36] A Supervised Learning Approach for Imbalanced Data Sets
    Nguyen, Giang H.
    Bouzerdoum, Abdesselam
    Phung, Son L.
    19TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1-6, 2008, : 3759 - 3762
  • [37] Evaluation of the Classifiers in Multiparameter and Imbalanced Data Sets
    Piotrowska, Ewelina
    INFORMATION SYSTEMS ARCHITECTURE AND TECHNOLOGY, ISAT 2019, PT II, 2020, 1051 : 263 - 273
  • [38] On Validation Setup for Multiclass Imbalanced Data Sets
    Silva, Evandro J. R.
    Zanchettin, Cleber
    PROCEEDINGS OF 2016 5TH BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS 2016), 2016, : 468 - 473
  • [39] Dynamic Feature Weighting for Imbalanced Data Sets
    Dialameh, Maryam
    Jahromi, Mansoor Zolghadri
    2015 SIGNAL PROCESSING AND INTELLIGENT SYSTEMS CONFERENCE (SPIS), 2015, : 31 - 36
  • [40] An Empirical Study on Preprocessing High-dimensional Class-imbalanced Data for Classification
    Yin, Hua
    Gai, Keke
    2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS), 2015, : 1314 - 1319