Improving Instance Selection Methods for Big Data Classification

被引:0
|
作者
Malhat, Mohamed [1 ]
El Menshawy, Mohamed [1 ]
Mousa, Hamdy [1 ]
El Sisi, Ashraf [1 ]
机构
[1] Menoufia Univ, Fac Comp & Informat, Comp Sci Dept, Shibin Al Kawm, Egypt
关键词
Big data; Data Mining; Data Reduction; Instance Selection; REDUCTION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The explosion of data in many application domains leads to a new term called big data. While the big data volume rapidly exceeds, the capacity and processing capabilities of contributed data mining algorithms are not effective. The instance selection methods become a mandatory step prior to applying data mining algorithms. Instance selection methods scale training set down by eliminating redundant, erroneous, and unrelated instances. Recently, instance selection methods have improved to work on big data sets by splitting training data into disjoint subsets and applying instance selection methods on individual subsets. However, these improved methods have a variable performance in the degree of reduction rate and classification accuracy. In this work, we propose an operational and unified framework to balance between reduction rate and classification accuracy. It starts with splitting a training set into class-balanced subsets to analyze the impact of the splitting process on the performance regarding the reduction rate and classification accuracy. It then applies two different instance selection methods on each subset. We implement and test experimentally the framework using two standard data sets. With the random splitting process as a benchmark, the results prove that the class-balanced splitting process is preferred regarding the classification accuracy criterion. The results also depict that the combination of two instance selection methods remarkably reduces the performance variability.
引用
收藏
页码:213 / 218
页数:6
相关论文
共 50 条
  • [41] Impact of feature selection methods on data classification for IDS
    Jiang, Shuai
    Xu, Xiaolong
    2019 INTERNATIONAL CONFERENCE ON CYBER-ENABLED DISTRIBUTED COMPUTING AND KNOWLEDGE DISCOVERY (CYBERC), 2019, : 174 - 180
  • [42] Feature selection methods for multiphase reactors data classification
    Tarca, LA
    Grandjean, BPA
    Larachi, F
    INDUSTRIAL & ENGINEERING CHEMISTRY RESEARCH, 2005, 44 (04) : 1073 - 1084
  • [43] Data classification for selection of temporal alerting methods for biosurveillance
    Burkom, Howard
    Murphy, Sean
    INTELLIGENCE AND SECURITY INFORMATICS: BIOSURVEILLANCE, PROCEEDINGS, 2007, 4506 : 59 - +
  • [44] Improving Hyperspectral Pixel Classification With Unsupervised Training Data Selection
    Rajadell, Olga
    Garcia-Sevilla, Pedro
    Viet Cuong Dinh
    Duin, Robert P. W.
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2014, 11 (03) : 656 - 660
  • [45] Improving Remote Sensing Multiple Classification by Data and Ensemble Selection
    Boukir, S.
    Guo, L.
    Chehata, N.
    PHOTOGRAMMETRIC ENGINEERING AND REMOTE SENSING, 2021, 87 (11): : 841 - 852
  • [46] Distributed Fuzzy Cognitive Maps for Feature Selection in Big Data Classification
    Haritha, K.
    Judy, M., V
    Papageorgiou, Konstantinos
    Georgiannis, Vassilis C.
    Papageorgiou, Elpiniki
    ALGORITHMS, 2022, 15 (10)
  • [47] Feature selection based on a crow search algorithm for big data classification
    Al-Thanoon, Niam Abdulmunim
    Algamal, Zakariya Yahya
    Qasim, Omar Saber
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2021, 212
  • [48] Fusion of instance selection methods in regression tasks
    Arnaiz-Gonzalez, Alvar
    Blachnik, Marcin
    Kordos, Miroslaw
    Garcia-Osorio, Cesar
    INFORMATION FUSION, 2016, 30 : 69 - 79
  • [50] Improving the Performance of Process Discovery Algorithms by Instance Selection
    Sani, Mohammadreza Fani
    van Zelst, Sebastiaan J.
    van der Aalst, Wil
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2020, 17 (03) : 927 - 958