ANT Corpus : An Arabic News Text Collection for Textual Classification

被引:23
|
作者
Chouigui, Amina [2 ]
Ben Khiroun, Oussama [1 ,2 ]
Elayeb, Bilel [1 ,3 ]
机构
[1] Manouba Univ, RIADI Res Lab, ENSI, Manouba 2010, Tunisia
[2] Sousse Univ, Natl Engn Sch Sousse, ENISO, Sousse 4002, Tunisia
[3] Emirates Coll Technol, POB 41009, Abu Dhabi, U Arab Emirates
关键词
Arabic language; standard Arabic corpus; text classification; RSS crawling; TREC format; SVM; NB; AGREEMENT; KAPPA;
D O I
10.1109/AICCSA.2017.22
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose in this paper a new online Arabic corpus of news articles, named ANT Corpus, which is collected from RSS Feeds. Each document represents an article structured in the standard XML TREC format. We use the ANT Corpus for Text Classification (TC) by applying the SVM and Naive Bayes (NB) classifiers to assign to each article its accurate predefined category. We study also in this work the contribution of terms weighting, stop-words removal and light stemming on Arabic TC. The experimental results prove that the text length affects considerably the TC accuracy and that titles words are not sufficiently significant to perform good classification rates. As a conclusion, the SVM method gives the best results of classification of both titles and texts parts.
引用
收藏
页码:135 / 142
页数:8
相关论文
共 50 条
  • [21] Text classification and gradation in Arabic textbooks
    Mohamed, Salwa
    LANGUAGE LEARNING JOURNAL, 2024, 52 (06): : 629 - 649
  • [22] Arabic Text Classification: New study
    Ayed, Rabii
    Labidi, Mohamed
    Maraoui, Mohsen
    2017 INTERNATIONAL CONFERENCE ON ENGINEERING & MIS (ICEMIS), 2017,
  • [23] COUNTER: corpus of Urdu news text reuse
    Muhammad Sharjeel
    Rao Muhammad Adeel Nawab
    Paul Rayson
    Language Resources and Evaluation, 2017, 51 : 777 - 803
  • [24] MINT - Mainstream and Independent News Text Corpus
    Caled, Danielle
    Carvalho, Paula
    Silva, Mario J.
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 26 - 36
  • [25] COUNTER: corpus of Urdu news text reuse
    Sharjeel, Muhammad
    Nawab, Rao Muhammad Adeel
    Rayson, Paul
    LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (03) : 777 - 803
  • [26] INFORMATION-CONTENT IN TEXTUAL DATA - REVISITED FOR ARABIC TEXT
    HEGAZI, N
    ALI, N
    ABED, E
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1987, 38 (02): : 133 - 137
  • [27] An application of textual document classification for Arabic governmental correspondence
    Alzamel, Khaled
    Alajmi, Manayer
    KUWAIT JOURNAL OF SCIENCE, 2025, 52 (01)
  • [28] Combining Emojis with Arabic Textual Features for Sentiment Classification
    Al-Azani, Sadam
    El-Alfy, El-Sayed M.
    2018 9TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION SYSTEMS (ICICS), 2018, : 139 - 144
  • [29] Arabic text detection in news video using RetinaNet
    Manita, Sameh
    Mansouri, Sadek
    Zrigui, Mounir
    Berchech, Salma
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KSE 2021), 2021, 192 : 796 - 803
  • [30] Textual Backdoor Attack for the Text Classification System
    Kwon, Hyun
    Lee, Sanghyun
    SECURITY AND COMMUNICATION NETWORKS, 2021, 2021