ANT Corpus : An Arabic News Text Collection for Textual Classification

被引:23
|
作者
Chouigui, Amina [2 ]
Ben Khiroun, Oussama [1 ,2 ]
Elayeb, Bilel [1 ,3 ]
机构
[1] Manouba Univ, RIADI Res Lab, ENSI, Manouba 2010, Tunisia
[2] Sousse Univ, Natl Engn Sch Sousse, ENISO, Sousse 4002, Tunisia
[3] Emirates Coll Technol, POB 41009, Abu Dhabi, U Arab Emirates
关键词
Arabic language; standard Arabic corpus; text classification; RSS crawling; TREC format; SVM; NB; AGREEMENT; KAPPA;
D O I
10.1109/AICCSA.2017.22
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose in this paper a new online Arabic corpus of news articles, named ANT Corpus, which is collected from RSS Feeds. Each document represents an article structured in the standard XML TREC format. We use the ANT Corpus for Text Classification (TC) by applying the SVM and Naive Bayes (NB) classifiers to assign to each article its accurate predefined category. We study also in this work the contribution of terms weighting, stop-words removal and light stemming on Arabic TC. The experimental results prove that the text length affects considerably the TC accuracy and that titles words are not sufficiently significant to perform good classification rates. As a conclusion, the SVM method gives the best results of classification of both titles and texts parts.
引用
收藏
页码:135 / 142
页数:8
相关论文
共 50 条
  • [1] An Arabic Corpus of Fake News: Collection, Analysis and Classification
    Alkhair, Maysoon
    Meftouh, Karima
    Smaili, Kamel
    Othman, Nouha
    ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, ICALP 2019, 2019, 1108 : 292 - 302
  • [2] Spanish news text collection + The 'Spanish News Corpus'
    Raschio, RA
    HISPANIA-A JOURNAL DEVOTED TO THE TEACHING OF SPANISH AND PORTUGUESE, 1996, 79 (03): : 502 - 503
  • [3] Arabic Text Classification of News Articles Using Classical Supervised Classifiers
    Al Qadi, Leen
    El Rifai, Hozayfa
    Obaid, Safa
    Elnagar, Ashraf
    2019 2ND INTERNATIONAL CONFERENCE ON NEW TRENDS IN COMPUTING SCIENCES (ICTCS), 2019, : 238 - 243
  • [4] SMAD: Text Classification of Arabic Social Media Dataset for News Sources
    Gaber, Amira M.
    El-din, Mohamed Nour
    Moussa, Hanan
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (10) : 508 - 516
  • [5] SMAD: Text Classification of Arabic Social Media Dataset for News Sources
    Gaber, Amira M.
    Gaber, Amira M.
    Moussa, Hanan
    International Journal of Advanced Computer Science and Applications, 2021, 12 (10): : 508 - 516
  • [6] Convolutional Deep Belief Network Based Short Text Classification on Arabic Corpus
    Motwakel A.
    Al-Onazi B.B.
    Alzahrani J.S.
    Marzouk R.
    Aziz A.S.A.
    Zamani A.S.
    Yaseen I.
    Abdelmageed A.A.
    Computer Systems Science and Engineering, 2023, 45 (03): : 3097 - 3113
  • [7] Building semantically annotated corpus for text classification of Indian defence news articles
    Kanekar S.A.
    Sharma A.
    Patkar G.S.
    Tilve A.K.S.
    International Journal of Information Technology, 2021, 13 (4) : 1539 - 1544
  • [8] Classification of Cyberbullying Text in Arabic
    Rachid, Benaissa Azzeddine
    Azza, Harbaoui
    Ben Ghezala, Hajjami Henda
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [9] Arabic Fake News Detection Based on Textual Analysis
    Hanen Himdi
    George Weir
    Fatmah Assiri
    Hassanin Al-Barhamtoshy
    Arabian Journal for Science and Engineering, 2022, 47 : 10453 - 10469
  • [10] Arabic Fake News Detection Based on Textual Analysis
    Himdi, Hanen
    Weir, George
    Assiri, Fatmah
    Al-Barhamtoshy, Hassanin
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2022, 47 (08) : 10453 - 10469