Automated Arabic Text Classification With P-Stemmer, Machine Learning, and a Tailored News Article Taxonomy

被引:27
|
作者
Kanan, Tarek [1 ]
Fox, Edward A. [2 ]
机构
[1] Al Zaytoonah Univ Jordan, Fac Sci & Informat Technol, Dept Software Engn, Amman, Jordan
[2] Virginia Polytech Inst & State Univ, Virginia Tech, Coll Engn, McBryde Hall Room 114 0106, Blacksburg, VA 24061 USA
关键词
digital libraries; information retrieval; natural language processing;
D O I
10.1002/asi.23609
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Arabic news articles in electronic collections are difficult to study. Browsing by category is rarely supported. Although helpful machine-learning methods have been applied successfully to similar situations for English news articles, limited research has been completed to yield suitable solutions for Arabic news. In connection with a Qatar National Research Fund (QNRF)-funded project to build digital library community and infrastructure in Qatar, we developed software for browsing a collection of about 237,000 Arabic news articles, which should be applicable to other Arabic news collections. We designed a simple taxonomy for Arabic news stories that is suitable for the needs of Qatar and other nations, is compatible with the subject codes of the International Press Telecommunications Council, and was enhanced with the aid of a librarian expert as well as five Arabic-speaking volunteers. We developed tailored stemming (i.e., a new Arabic light stemmer called P-Stemmer) and automatic classification methods (the best being binary Support Vector Machines classifiers) to work with the taxonomy. Using evaluation techniques commonly used in the information retrieval community, including 10-fold cross-validation and the Wilcoxon signed-rank test, we showed that our approach to stemming and classification is superior to state-of-the-art techniques.
引用
收藏
页码:2667 / 2683
页数:17
相关论文
共 26 条
  • [21] Leveraging Automated Machine Learning for Text Classification: Evaluation of AutoML Tools and Comparison with Human Performance
    Blohm, Matthias
    Hanussek, Marc
    Kintz, Maximilien
    ICAART: PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE - VOL 2, 2021, : 1131 - 1136
  • [22] Multi-Class Text Classification on Khmer News Using Ensemble Method in Machine Learning Algorithms
    Phann, Raksmey
    Soomlek, Chitsutha
    Seresangtakul, Pusadee
    ACTA INFORMATICA PRAGENSIA, 2023, 12 (02) : 243 - 259
  • [23] RETRACTED ARTICLE: Automated query classification based web service similarity technique using machine learning
    B. Saravana Balaji
    S. Balakrishnan
    K. Venkatachalam
    V. Jeyakrishnan
    Journal of Ambient Intelligence and Humanized Computing, 2021, 12 : 6169 - 6180
  • [24] RETRACTED: Automated query classification based web service similarity technique using machine learning (Retracted Article)
    Balaji, B. Saravana
    Balakrishnan, S.
    Venkatachalam, K.
    Jeyakrishnan, V.
    JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2021, 12 (06) : 6169 - 6180
  • [25] Letter to editor regarding article "fully automated radiomics-based machine learning models for multiclass classification of single brain tumors: Glioblastoma, lymphoma, and metastasis"
    Priya, Sarv
    Ward, Caitlin
    Bathla, Girish
    JOURNAL OF NEURORADIOLOGY, 2023, 50 (01) : 40 - 41
  • [26] Digital misinformation and fake news detection using WoT integration with Asian social networks fusion based feature extraction with text and image classification by machine learning architectures
    Surekha, T. Lakshmi
    Rao, N. Chandra Sekhara
    Shahnazeer, C. K.
    Yaseen, Syed Mufassir
    Shukla, Surendra Kumar
    Singh, Bharat
    Arumugam, Mahendran
    THEORETICAL COMPUTER SCIENCE, 2022, 927 : 1 - 14