Investigating the Relevance of Arabic Text Classification Datasets Based on Supervised Learning

被引:2
|
作者
Ababneh A.H. [1 ]
机构
[1] Computer Science Department, American University of Madaba, Madaba
关键词
K-nearest neighbor (knn); Logistic regression (lr); Naive bayes (nb); Random forest (rf); Support vector machine (svm); Text classification (tc);
D O I
10.1016/j.jnlest.2022.100160
中图分类号
学科分类号
摘要
Training and testing different models in the field of text classification mainly depend on the pre-classified text document datasets. Recently, seven datasets have emerged for Arabic text classification, including Single-Label Arabic News Articles Dataset (SANAD), Khaleej, Arabiya, Akhbarona, KALIMAT, Waten2004, and Khaleej2004. This study investigates which of these datasets can provide significant training and fair evaluation for text classification (TC). In this investigation, well-known and accurate learning models are used, including naive Bayes (NB), random forest (RF), K-nearest neighbor (KNN), support vector machines (SVM), and logistic regression (LR) models. We present relevance and time measures of training the models with these datasets to enable Arabic language researchers to select the appropriate dataset to use based on a solid basis of comparison. The performances of the five learning models across the seven datasets are measured and compared with the performances of the same models trained on a well-known English language dataset. The analysis of the relevance and time scores shows that training the SVM model on Khaleej and Arabiya obtained the most significant results in the shortest amount of time, with the accuracy of 82%. © 2022, Journal of Electronic Science and Technology. All Rights Reserved.
引用
收藏
页码:187 / 208
页数:21
相关论文
共 50 条
  • [41] Survey on supervised machine learning techniques for automatic text classification
    Ammar Ismael Kadhim
    Artificial Intelligence Review, 2019, 52 : 273 - 292
  • [42] SEMI-SUPERVISED LEARNING FOR TEXT CLASSIFICATION BY LAYER PARTITIONING
    Li, Alexander Hanbo
    Sethy, Abhinav
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6164 - 6168
  • [43] Investigating self-supervised learning for Skin Lesion Classification
    Morita, Takumi
    Han, Xian-Hua
    2023 18TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, MVA, 2023,
  • [44] Remote Sensing Image Scene Classification with Self-Supervised Learning Based on Partially Unlabeled Datasets
    Chen, Xiliang
    Zhu, Guobin
    Liu, Mingqing
    REMOTE SENSING, 2022, 14 (22)
  • [45] Utilizing Deep Learning in Arabic Text Classification Sentiment Analysis of Twitter
    Ibrahim, Nehad M.
    Yafooz, Wael M. S.
    Emara, Abdel-Hamid M.
    Abdel-Wahab, Ahmed
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (12) : 830 - 838
  • [46] Supervised Variational Relevance Learning, An Analytic Geometric Feature Selection with Applications to Omic Datasets
    Boareto, Marcelo
    Cesar, Jonatas
    Leite, Vitor B. P.
    Caticha, Nestor
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2015, 12 (03) : 705 - 711
  • [47] Combining active learning and relevance vector machines for text classification
    Silva, C.
    Ribeiro, B.
    ICMLA 2007: SIXTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2007, : 130 - +
  • [48] Abstractive Arabic Text Summarization Based on Deep Learning
    Wazery, Y. M.
    Saleh, Marwa E.
    Alharbi, Abdullah
    Ali, Abdelmgeid A.
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [49] Arabic Text Steganography Based on Deep Learning Methods
    Adeeb, Omer Farooq Ahmed
    Kabudian, Seyed Jahanshah
    IEEE ACCESS, 2022, 10 : 94403 - 94416
  • [50] Firefly Algorithm based Feature Selection for Arabic Text Classification
    Marie-Sainte, Souad Larabi
    Alalyani, Nada
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2020, 32 (03) : 320 - 328