Stemming versus light stemming as feature selection techniques for Arabic text categorization

被引:0
|
作者
Duwairi, Rehab
Al-Refai, Mohammad
Khasawneh, Natheer
机构
关键词
text categorization; feature selection; Arabic language; stemming; light-stemming; K-nearest neighbors classifier;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
This paper compares and contrasts two feature selection techniques when applied to Arabic corpus; in particular; stemming, and light stemming were employed. With stemming, words are reduced to their stems. With light stemming words are reduced to their light stems. Stemming is aggressive in the sense that it reduces words to their 3-letters roots. This affects the semantics as several words with different meanings might have the same root. Light stemming, by comparison, removes frequently used prefixes and suffixes in Arabic words. Light stemming doesn't produce the root and therefore doesn't affect the semantics of words; it maps several words, which have the same meaning to a common syntactical form. The effectiveness of above two feature selection techniques was assessed in a text categorization exercise for Arabic corpus. This corpus consists of 15000 documents that fall into three categories. The K-nearest neighbors (KNN) classifier was used in this work. Several experiments were carried out using two different representations of the same corpus; the first version uses stem-vectors; and the second uses light stem-vectors as representatives of documents. These two representations were assessed in terms of size, time and accuracy. The light stem representation was superior in terms of classifier accuracy when compared with stemming.
引用
收藏
页码:199 / 203
页数:5
相关论文
共 50 条
  • [21] Stemming Algorithm for Arabic Text Using a Parallel Data Processing
    Bougar, Marieme
    Ziyati, El Houssaine
    THIRD INTERNATIONAL CONGRESS ON INFORMATION AND COMMUNICATION TECHNOLOGY, 2019, 797 : 261 - 268
  • [22] Enhanced Filter Feature Selection Methods for Arabic Text Categorization
    Ghareb, Abdullah Saeed
    Abu Bakara, Azuraliza
    Al-Radaideh, Qasem A.
    Hamdan, Abdul Razak
    INTERNATIONAL JOURNAL OF INFORMATION RETRIEVAL RESEARCH, 2018, 8 (02) : 1 - 24
  • [23] Addressing Stemming Algorithm for Arabic Text Using Spark Over Hadoop
    Bougar, Marieme
    Ziyati, El Houssaine
    ADVANCED INTELLIGENT SYSTEMS FOR SUSTAINABLE DEVELOPMENT (AI2SD'2019): VOL 1 - ADVANCED INTELLIGENT SYSTEMS FOR EDUCATION AND INTELLIGENT LEARNING SYSTEM, 2020, 1102 : 74 - 82
  • [24] Combining a Novel Scoring Approach with Arabic Stemming Techniques for Arabic Chatbots Conversation Engine
    Alshammari, Nasser O.
    Alharbi, Fawaz D.
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (04)
  • [25] A comparison of text classification methods using different stemming techniques
    Bounabi, Mariem
    El Moutaouakil, Karim
    Satori, Khalid
    INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS IN TECHNOLOGY, 2019, 60 (04) : 298 - 306
  • [26] Stemming versus multi-words indexing for Arabic documents classification
    El Bazzi, Mohamed Salim
    Zaki, Taher
    Mammass, Driss
    Ennaji, Abdelatif
    2016 11TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS: THEORIES AND APPLICATIONS (SITA), 2016,
  • [27] New Model of Feature Selection based Chaotic Firefly Algorithm for Arabic Text Categorization
    Hadni, Meryeme
    Hjiaj, Hassane
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2023, 20 (3A) : 461 - 468
  • [28] Feature selection in SVM text categorization
    Taira, H
    Haruno, M
    SIXTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-99)/ELEVENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE (IAAI-99), 1999, : 480 - 486
  • [29] Feature selection strategies for text categorization
    Soucy, P
    Mineau, GW
    ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2003, 2671 : 505 - 509
  • [30] Arabic text categorization system - Using Ant Colony Optimization-based feature selection
    Mesleh, Abdelwadood Moh'd A.
    Kanaan, Ghassan
    ICSOFT 2008: PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES, VOL PL/DPS/KE, 2008, : 384 - 387