KerMinSVM for imbalanced datasets with a case study on arabic comics classification

被引:5
|
作者
Nayal, Ammar [1 ]
Jomaa, Hadi [1 ]
Awad, Marlette [1 ]
机构
[1] Amer Univ Beirut, Dept Elect & Comp Engn, Beirut, Lebanon
基金
新加坡国家研究基金会;
关键词
Imbalance datasets; Support vector machines; Arabic comics analysis; Natural language processing; Supervised classification;
D O I
10.1016/j.engappai.2017.01.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many studies have been performed to classify large-sized text documents using different classifiers, ranging from simple distance classifiers such as K-Nearest-Neighbor (KNN) to more advanced classifiers such as Support Vector Machines. Traditional approaches fail when a short text is encountered due to sparsity resulting from a limited number of words. Another common problem in text classification is class imbalance (CI). CI occurs when one class of the data contains most of the samples while the other class contains only a few. Standard classifiers, when applied to imbalanced data, result in high accuracy for the majority class and low accuracy for the minority one. We were motivated to propose a novel framework for classifying the content of Arabic comics; therefore, we propose KerMinSVM, a kernel extension of our previously proposed MinSVM coupled with a new dimensionality featuring a reduction scheme based on word root frequency ratios (WRFR). KerMinSVM was tested on multiple imbalanced benchmark datasets, and the results were verified using three measures: accuracy, F-measure, and statistical analysis. WRFR was applied to the manual construction of the Arabic comic text dataset to detect strong content in children's comic books. Test results revealed that our proposed framework outperformed most of the methods for imbalanced datasets and short text classification.
引用
收藏
页码:159 / 169
页数:11
相关论文
共 50 条
  • [31] An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection
    Polydouri, Andrianna
    Vathi, Eleni
    Siolas, Georgios
    Stafylopatis, Andreas
    EVOLVING SYSTEMS, 2020, 11 (03) : 503 - 515
  • [32] GUM: A Guided Undersampling Method to Preprocess Imbalanced Datasets for Classification
    Sung, Kisuk
    Brown, W. Eric
    Moreno-Centeno, Erick
    Ding, Yu
    2022 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATION SCIENCE AND ENGINEERING (CASE), 2022, : 1086 - 1091
  • [33] Balanced Sampling Meets Imbalanced Datasets in SAR Image Classification
    Jahan, Chowdhury Sadman
    Savakis, Andreas
    GEOSPATIAL INFORMATICS XIII, 2023, 12525
  • [34] An improved Support Vector Machine for the classification of imbalanced biological datasets
    Wang, Haiying
    Zheng, Huiru
    ADVANCED INTELLIGENT COMPUTING THEORIES AND APPLICATIONS, PROCEEDINGS: WITH ASPECTS OF THEORETICAL AND METHODOLOGICAL ISSUES, 2008, 5226 : 63 - +
  • [35] Improving the Performance of Sentiment Classification on Imbalanced Datasets With Transfer Learning
    Xiao, Z.
    Wang, L.
    Du, J. Y.
    IEEE ACCESS, 2019, 7 : 28281 - 28290
  • [36] RSMOTE: improving classification performance over imbalanced medical datasets
    Mehdi Naseriparsa
    Ahmed Al-Shammari
    Ming Sheng
    Yong Zhang
    Rui Zhou
    Health Information Science and Systems, 8
  • [37] Preprocessing compensation techniques for improved classification of imbalanced medical datasets
    Wosiak, Agnieszka
    Karbowiak, Sylwia
    PROCEEDINGS OF THE 2017 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), 2017, : 203 - 211
  • [38] Effects of the Use of Boosting on Classification Performance of Imbalanced Bioinformatics Datasets
    Khoshgoftaar, Taghi M.
    Fazelpour, Alireza
    Dittman, David J.
    Napolitano, Amri
    2014 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE), 2014, : 420 - 426
  • [39] Improving SVM Classification on Imbalanced Datasets by Introducing a New Bias
    Nunez, Haydemar
    Gonzalez-Abril, Luis
    Angulo, Cecilio
    JOURNAL OF CLASSIFICATION, 2017, 34 (03) : 427 - 443
  • [40] GradMix for Nuclei Segmentation and Classification in Imbalanced Pathology Image Datasets
    Doan, Tan Nhu Nhat
    Kim, Kyungeun
    Song, Boram
    Kwak, Jin Tae
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT II, 2022, 13432 : 171 - 180