Improving Instance Selection Methods for Big Data Classification

被引:0
|
作者
Malhat, Mohamed [1 ]
El Menshawy, Mohamed [1 ]
Mousa, Hamdy [1 ]
El Sisi, Ashraf [1 ]
机构
[1] Menoufia Univ, Fac Comp & Informat, Comp Sci Dept, Shibin Al Kawm, Egypt
关键词
Big data; Data Mining; Data Reduction; Instance Selection; REDUCTION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The explosion of data in many application domains leads to a new term called big data. While the big data volume rapidly exceeds, the capacity and processing capabilities of contributed data mining algorithms are not effective. The instance selection methods become a mandatory step prior to applying data mining algorithms. Instance selection methods scale training set down by eliminating redundant, erroneous, and unrelated instances. Recently, instance selection methods have improved to work on big data sets by splitting training data into disjoint subsets and applying instance selection methods on individual subsets. However, these improved methods have a variable performance in the degree of reduction rate and classification accuracy. In this work, we propose an operational and unified framework to balance between reduction rate and classification accuracy. It starts with splitting a training set into class-balanced subsets to analyze the impact of the splitting process on the performance regarding the reduction rate and classification accuracy. It then applies two different instance selection methods on each subset. We implement and test experimentally the framework using two standard data sets. With the random splitting process as a benchmark, the results prove that the class-balanced splitting process is preferred regarding the classification accuracy criterion. The results also depict that the combination of two instance selection methods remarkably reduces the performance variability.
引用
收藏
页码:213 / 218
页数:6
相关论文
共 50 条
  • [21] Instance selection improves geometric mean accuracy: a study on imbalanced data classification
    Kuncheva, Ludmila I.
    Arnaiz-Gonzalez, Alvar
    Diez-Pastor, Jose-Francisco
    Gunn, Iain A. D.
    PROGRESS IN ARTIFICIAL INTELLIGENCE, 2019, 8 (02) : 215 - 228
  • [22] Instance selection improves geometric mean accuracy: a study on imbalanced data classification
    Ludmila I. Kuncheva
    Álvar Arnaiz-González
    José-Francisco Díez-Pastor
    Iain A. D. Gunn
    Progress in Artificial Intelligence, 2019, 8 : 215 - 228
  • [23] Data Feature Selection Methods on Distributed Big Data Processing Platforms
    Catalkaya, Mehmet Burak
    Kalipsiz, Oya
    Aktas, Mehmet S.
    Turgut, Umut Orcun
    2018 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2018, : 133 - 138
  • [24] A New Big Data Feature Selection Approach for Text Classification
    Amazal, Houda
    Kissi, Mohamed
    SCIENTIFIC PROGRAMMING, 2021, 2021
  • [25] Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach
    Peralta, Daniel
    del Rio, Sara
    Ramirez-Gallego, Sergio
    Triguero, Isaac
    Benitez, Josem.
    Herrera, Francisco
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2015, 2015
  • [26] Feature Selection and Classification of Big Data Using MapReduce Framework
    Devi, D. Renuka
    Sasikala, S.
    INTELLIGENT COMPUTING, INFORMATION AND CONTROL SYSTEMS, ICICCS 2019, 2020, 1039 : 666 - 673
  • [27] Improving Instance Selection via Metric Learning
    Max, Eduardo Zarate
    Marcacini, Ricardo Marcondes
    Matsubara, Edson Takashi
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [28] Combining instance selection methods based on data characterization: An approach to increase their effectiveness
    Caises, Yoel
    Gonzalez, Antonio
    Leyva, Enrique
    Perez, Raul
    INFORMATION SCIENCES, 2011, 181 (20) : 4780 - 4798
  • [29] Simultaneous feature and instance selection in big noisy data using memetic variable neighborhood search
    Lin, Chun-Cheng
    Kang, Jia-Rong
    Liang, Yu-Lin
    Kuo, Chih-Chi
    APPLIED SOFT COMPUTING, 2021, 112
  • [30] Instance selection for big data based on locally sensitive hashing and double-voting mechanism
    Junhai Zhai
    Yajie Huang
    Advances in Computational Intelligence, 2022, 2 (2):