KerMinSVM for imbalanced datasets with a case study on arabic comics classification

被引:5
|
作者
Nayal, Ammar [1 ]
Jomaa, Hadi [1 ]
Awad, Marlette [1 ]
机构
[1] Amer Univ Beirut, Dept Elect & Comp Engn, Beirut, Lebanon
基金
新加坡国家研究基金会;
关键词
Imbalance datasets; Support vector machines; Arabic comics analysis; Natural language processing; Supervised classification;
D O I
10.1016/j.engappai.2017.01.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many studies have been performed to classify large-sized text documents using different classifiers, ranging from simple distance classifiers such as K-Nearest-Neighbor (KNN) to more advanced classifiers such as Support Vector Machines. Traditional approaches fail when a short text is encountered due to sparsity resulting from a limited number of words. Another common problem in text classification is class imbalance (CI). CI occurs when one class of the data contains most of the samples while the other class contains only a few. Standard classifiers, when applied to imbalanced data, result in high accuracy for the majority class and low accuracy for the minority one. We were motivated to propose a novel framework for classifying the content of Arabic comics; therefore, we propose KerMinSVM, a kernel extension of our previously proposed MinSVM coupled with a new dimensionality featuring a reduction scheme based on word root frequency ratios (WRFR). KerMinSVM was tested on multiple imbalanced benchmark datasets, and the results were verified using three measures: accuracy, F-measure, and statistical analysis. WRFR was applied to the manual construction of the Arabic comic text dataset to detect strong content in children's comic books. Test results revealed that our proposed framework outperformed most of the methods for imbalanced datasets and short text classification.
引用
收藏
页码:159 / 169
页数:11
相关论文
共 50 条
  • [1] Handling imbalanced classification problem: A case study on social media datasets
    Nguyen, Tuong Tri
    Hwang, Dosam
    Jung, Jason J.
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2017, 32 (02) : 1437 - 1448
  • [2] To improve classification of imbalanced datasets
    Shukla, Pratyusha
    Bhowmick, Kiran
    2017 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION, EMBEDDED AND COMMUNICATION SYSTEMS (ICIIECS), 2017,
  • [3] Binary classification of imbalanced datasets: The case of CoIL challenge 2000
    Darzi, Mohammad Rasoul Khalilpour
    Niaki, Seyed Taghi Akhavan
    Khedmati, Majid
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 128 (169-186) : 169 - 186
  • [4] Cyberbullying detection framework for short and imbalanced Arabic datasets
    Alzaqebah, Malek
    Jaradat, Ghaith M.
    Nassan, Dania
    Alnasser, Rawan
    Alsmadi, Mutasem K.
    Almarashdeh, Ibrahim
    Jawarneh, Sana
    Alwohaibi, Maram
    Al-Mulla, Noha A.
    Alshehab, Nouf
    Alkhushayni, Suboh
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (08)
  • [5] Classification of Antimicrobial Peptides with Imbalanced Datasets
    Camacho, Francy L.
    Torres, Rodrigo
    Ramos Pollan, Raul
    11TH INTERNATIONAL SYMPOSIUM ON MEDICAL INFORMATION PROCESSING AND ANALYSIS, 2015, 9681
  • [6] Discrimination Aware Classification for Imbalanced Datasets
    Ristanoski, Goce
    Liu, Wei
    Bailey, James
    PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1529 - 1532
  • [7] Empirical Study of Sampling Methods for Classification in Imbalanced Clinical Datasets
    Kasem, Asem
    Ghaibeh, A. Ammar
    Moriguchi, Hiroki
    COMPUTATIONAL INTELLIGENCE IN INFORMATION SYSTEMS, CIIS 2016, 2017, 532 : 152 - 162
  • [8] Study on source of classification in imbalanced datasets based on new ensemble classifier
    Zhai Y.
    Yang B.-R.
    Qu W.
    Sui H.-F.
    Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2011, 33 (01): : 196 - 201
  • [9] A robust loss function for classification with imbalanced datasets
    Wang, Yidan
    Yang, Liming
    NEUROCOMPUTING, 2019, 331 : 40 - 49
  • [10] Imbalanced classification in sparse and large behaviour datasets
    Jellis Vanhoeyveld
    David Martens
    Data Mining and Knowledge Discovery, 2018, 32 : 25 - 82