Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

被引:17
|
作者
Wojciechowski S. [1 ]
Wilk S. [1 ]
机构
[1] Institute of Computing Science, Poznan University of Technology, Piotrowo 2, Poznan
来源
| 1600年 / Walter de Gruyter GmbH卷 / 42期
关键词
difficulty factors; imbalanced data; learning and classification; preprocessing methods;
D O I
10.1515/fcds-2017-0007
中图分类号
学科分类号
摘要
In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN. © by Szymon Wilk 2017.
引用
收藏
页码:149 / 176
页数:27
相关论文
共 50 条
  • [21] The Text Classification for Imbalanced Data Sets
    Li, Yanling
    Zhu, Yehang
    Yang, Ping
    ISISE 2008: INTERNATIONAL SYMPOSIUM ON INFORMATION SCIENCE AND ENGINEERING, VOL 2, 2008, : 778 - +
  • [22] The study of preprocessing methods' utility in analysis of multidimensional and highly imbalanced medical data
    Werner, Aleksandra
    Bach, Malgorzata
    Pluskiewicz, Wojciech
    PROCEEDINGS OF THE 11TH SCIENTIFIC CONFERENCE INTERNET IN THE INFORMATION SOCIETY 2016, 2016, : 71 - 87
  • [23] An empirical study of the behavior of classifiers on imbalanced and overlapped data sets
    Garcia, Vicente
    Sanchez, Jose
    Mollineda, Ramon
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS AND APPLICATIONS, PROCEEDINGS, 2007, 4756 : 397 - +
  • [24] Preprocessing of Experimental Seismic Data
    Allakhverdiyeva, Naila
    2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT), 2014, : 207 - 210
  • [25] Evaluating Difficulty of Multi-class Imbalanced Data
    Lango, Mateusz
    Napierala, Krystyna
    Stefanowski, Jerzy
    FOUNDATIONS OF INTELLIGENT SYSTEMS, ISMIS 2017, 2017, 10352 : 312 - 322
  • [26] Creation, population and preprocessing of experimental data sets for evaluation of applications for the semantic web
    Frivolt, Gyoergy
    Suchal, Jan
    Vesely, Richard
    Vojtek, Peter
    Vozar, Oto
    Bielikova, Maria
    SOFSEM 2008: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2008, 4910 : 684 - 695
  • [27] Rough sets method for SVM data preprocessing
    Li, Y
    Cai, YZ
    Li, YG
    Xu, XM
    2004 IEEE CONFERENCE ON CYBERNETICS AND INTELLIGENT SYSTEMS, VOLS 1 AND 2, 2004, : 1039 - 1042
  • [28] NEW HYBRID DATA PREPROCESSING TECHNIQUE FOR HIGHLY IMBALANCED DATASET
    Malik, Esraa Faisal
    Khaw, Khai Wah
    Chew, XinYing
    COMPUTING AND INFORMATICS, 2022, 41 (04) : 981 - 1001
  • [29] An Algorithm for Selective Preprocessing of Multi-class Imbalanced Data
    Wojciechowski, Szymon
    Wilk, Szymon
    Stefanowski, Jerzy
    PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON COMPUTER RECOGNITION SYSTEMS CORES 2017, 2018, 578 : 238 - 247
  • [30] Data Preprocessing for DES-KNN and Its Application to Imbalanced Medical Data Classification
    Kinal, Maciej
    Wozniak, Michal
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS (ACIIDS 2020), PT I, 2020, 12033 : 589 - 599