ANT Corpus : An Arabic News Text Collection for Textual Classification

被引:23
|
作者
Chouigui, Amina [2 ]
Ben Khiroun, Oussama [1 ,2 ]
Elayeb, Bilel [1 ,3 ]
机构
[1] Manouba Univ, RIADI Res Lab, ENSI, Manouba 2010, Tunisia
[2] Sousse Univ, Natl Engn Sch Sousse, ENISO, Sousse 4002, Tunisia
[3] Emirates Coll Technol, POB 41009, Abu Dhabi, U Arab Emirates
关键词
Arabic language; standard Arabic corpus; text classification; RSS crawling; TREC format; SVM; NB; AGREEMENT; KAPPA;
D O I
10.1109/AICCSA.2017.22
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose in this paper a new online Arabic corpus of news articles, named ANT Corpus, which is collected from RSS Feeds. Each document represents an article structured in the standard XML TREC format. We use the ANT Corpus for Text Classification (TC) by applying the SVM and Naive Bayes (NB) classifiers to assign to each article its accurate predefined category. We study also in this work the contribution of terms weighting, stop-words removal and light stemming on Arabic TC. The experimental results prove that the text length affects considerably the TC accuracy and that titles words are not sufficiently significant to perform good classification rates. As a conclusion, the SVM method gives the best results of classification of both titles and texts parts.
引用
收藏
页码:135 / 142
页数:8
相关论文
共 50 条
  • [41] A Deep Learning Approach for Arabic Text Classification
    Sundus, Katrina
    Al-Haj, Fatima
    Hammo, Bassam
    2019 2ND INTERNATIONAL CONFERENCE ON NEW TRENDS IN COMPUTING SCIENCES (ICTCS), 2019, : 258 - 264
  • [42] NADA: New Arabic Dataset for Text Classification
    Alalyani, Nada
    Marie-Sainte, Souad Larabi
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (09) : 206 - 212
  • [43] Evaluating Various Tokenizers for Arabic Text Classification
    Zaid Alyafeai
    Maged S. Al-shaibani
    Mustafa Ghaleb
    Irfan Ahmad
    Neural Processing Letters, 2023, 55 : 2911 - 2933
  • [44] Evaluating Various Tokenizers for Arabic Text Classification
    Alyafeai, Zaid
    Al-shaibani, Maged S.
    Ghaleb, Mustafa
    Ahmad, Irfan
    NEURAL PROCESSING LETTERS, 2023, 55 (03) : 2911 - 2933
  • [45] Named entity recognition and classification for text in arabic
    Abuleil, S
    Evens, M
    INTELLIGENT AND ADAPTIVE SYSTEMS AND SOFTWARE ENGINEERING, 2004, : 89 - 94
  • [46] Arabic text classification based on analogical proportions
    Bounhas, Myriam
    Elayeb, Bilel
    Chouigui, Amina
    Hussain, Amir
    Cambria, Erik
    EXPERT SYSTEMS, 2024, 41 (10)
  • [47] Crime Type Document Classification from Arabic Corpus
    Alruily, Meshrif
    Ayesh, Aladdin
    Zedan, Hussein
    2009 SECOND INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ESYSTEMS ENGINEERING (DESE 2009), 2009, : 153 - 159
  • [48] An Enhanced Twitter Corpus for the Classification of Arabic Speech Acts
    Ahed, Majdi
    Hammo, Bassam H.
    Abushariah, Mohammad A. M.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (03) : 207 - 215
  • [49] Arabic-English Corpus for Cross-Language Textual Similarity Detection
    Aljuaid, Hanan
    INFORMATION SCIENCE AND APPLICATIONS, 2020, 621 : 527 - 536
  • [50] Enhanced Arabic information retrieval system based on Arabic text classification
    Ghwanmeh, Sameh
    Kanaan, Ghassan
    Al-Shalabi, Riyad
    Ababneh, Ahmad
    2007 INNOVATIONS IN INFORMATION TECHNOLOGIES, VOLS 1 AND 2, 2007, : 527 - +