Stemming versus light stemming as feature selection techniques for Arabic text categorization

被引:0
|
作者
Duwairi, Rehab
Al-Refai, Mohammad
Khasawneh, Natheer
机构
关键词
text categorization; feature selection; Arabic language; stemming; light-stemming; K-nearest neighbors classifier;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
This paper compares and contrasts two feature selection techniques when applied to Arabic corpus; in particular; stemming, and light stemming were employed. With stemming, words are reduced to their stems. With light stemming words are reduced to their light stems. Stemming is aggressive in the sense that it reduces words to their 3-letters roots. This affects the semantics as several words with different meanings might have the same root. Light stemming, by comparison, removes frequently used prefixes and suffixes in Arabic words. Light stemming doesn't produce the root and therefore doesn't affect the semantics of words; it maps several words, which have the same meaning to a common syntactical form. The effectiveness of above two feature selection techniques was assessed in a text categorization exercise for Arabic corpus. This corpus consists of 15000 documents that fall into three categories. The K-nearest neighbors (KNN) classifier was used in this work. Several experiments were carried out using two different representations of the same corpus; the first version uses stem-vectors; and the second uses light stem-vectors as representatives of documents. These two representations were assessed in terms of size, time and accuracy. The light stem representation was superior in terms of classifier accuracy when compared with stemming.
引用
收藏
页码:199 / 203
页数:5
相关论文
共 50 条
  • [1] A New and Efficient Stemming Technique for Arabic Text Categorization
    Hadni, M.
    Lachkar, A.
    Alaoui Ouatik, S.
    2012 INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS (ICMCS), 2012, : 791 - 796
  • [2] Stemming Impact on Arabic Text Categorization Performance: a Survey
    Al-Anzi, Fawaz S.
    AbuZeina, Dia
    2015 5TH INTERNATIONAL CONFERENCE ON INFORMATION & COMMUNICATION TECHNOLOGY AND ACCESSIBILITY (ICTA), 2015,
  • [3] Contextual Text Categorization: An Improved Stemming Algorithm to Increase the Quality of Categorization in Arabic Text
    Gadri, Said
    Moussaoui, Abdelouahab
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2017, 14 (06) : 835 - 841
  • [4] The Effect of using Light Stemming for Arabic Text Classification
    Atwan, Jaffar
    Wedyan, Mohammad
    Bsoul, Qusay
    Hamadeen, Ahmad
    Alturki, Ryan
    Ikram, Mohammed
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (05) : 768 - 773
  • [5] Impact of stemming on Arabic text summarization
    Alami, Nabil
    Meknassi, Mohammed
    Ouatik, Said Alaoui
    Ennahnahi, NourEddine
    2016 4TH IEEE INTERNATIONAL COLLOQUIUM ON INFORMATION SCIENCE AND TECHNOLOGY (CIST), 2016, : 338 - 343
  • [6] Stemming Versus Light Stemming for Measuring the Simitilarity between Arabic Words with Latent Semantic Analysis Model
    Froud, Hanane
    Lachkar, Abdelmonaime
    Ouatik, Said Alaoui
    2012 COLLOQUIUM ON INFORMATION SCIENCE AND TECHNOLOGY (CIST'12), 2012, : 69 - 73
  • [7] Impact of Stemming and Word Embedding on Deep Learning-Based Arabic Text Categorization
    Almuzaini, Huda Abdulrahman
    Azmi, Aqil M.
    IEEE ACCESS, 2020, 8 : 127913 - 127928
  • [8] Arabic Text Stemming: Comparative Analysis.
    Mamoun, Rasha
    Ahmed, Mahmoud
    2016 CONFERENCE OF BASIC SCIENCES AND ENGINEERING STUDIES (SCGAC), 2016, : 88 - 93
  • [9] Feature Reduction Techniques for Arabic Text Categorization
    Duwairi, Rehab
    Al-Refai, Mohammad Nayef
    Khasawneh, Natheer
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2009, 60 (11): : 2347 - 2352
  • [10] Stemming Malay Text and Its Application in Automatic Text Categorization
    Yasukawa, Michiko
    Lim, Hui Tian
    Yokoo, Hidetoshi
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2009, E92D (12): : 2351 - 2359