Stemming versus light stemming as feature selection techniques for Arabic text categorization

被引:0
|
作者
Duwairi, Rehab
Al-Refai, Mohammad
Khasawneh, Natheer
机构
关键词
text categorization; feature selection; Arabic language; stemming; light-stemming; K-nearest neighbors classifier;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
This paper compares and contrasts two feature selection techniques when applied to Arabic corpus; in particular; stemming, and light stemming were employed. With stemming, words are reduced to their stems. With light stemming words are reduced to their light stems. Stemming is aggressive in the sense that it reduces words to their 3-letters roots. This affects the semantics as several words with different meanings might have the same root. Light stemming, by comparison, removes frequently used prefixes and suffixes in Arabic words. Light stemming doesn't produce the root and therefore doesn't affect the semantics of words; it maps several words, which have the same meaning to a common syntactical form. The effectiveness of above two feature selection techniques was assessed in a text categorization exercise for Arabic corpus. This corpus consists of 15000 documents that fall into three categories. The K-nearest neighbors (KNN) classifier was used in this work. Several experiments were carried out using two different representations of the same corpus; the first version uses stem-vectors; and the second uses light stem-vectors as representatives of documents. These two representations were assessed in terms of size, time and accuracy. The light stem representation was superior in terms of classifier accuracy when compared with stemming.
引用
收藏
页码:199 / 203
页数:5
相关论文
共 50 条
  • [41] A General Framework of Feature Selection for Text Categorization
    Jing, Hongfang
    Wang, Bin
    Yang, Yahui
    Xu, Yan
    MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, 2009, 5632 : 647 - +
  • [42] A feature selection and classification technique for text categorization
    Girgis, MR
    Aly, AA
    INTERNATIONAL JOURNAL OF COOPERATIVE INFORMATION SYSTEMS, 2003, 12 (04) : 441 - 454
  • [43] Text Categorization Based on Clustering Feature Selection
    Zhou, Xiaofei
    Hu, Yue
    Guo, Li
    2ND INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND QUANTITATIVE MANAGEMENT, ITQM 2014, 2014, 31 : 398 - 405
  • [44] An Effective Feature Selection Method for Text Categorization
    Qiu, Xipeng
    Zhou, Jinlong
    Huang, Xuanjing
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT I: 15TH PACIFIC-ASIA CONFERENCE, PAKDD 2011, 2011, 6634 : 50 - 61
  • [45] An examination of feature selection frameworks in text categorization
    How, BC
    Kiong, WT
    INFORMATION RETRIEVAL TECHNOLOGY, PROCEEDINGS, 2005, 3689 : 558 - 564
  • [46] A combination of Low-level light stemming and Support Vector Machines for the classification of Arabic opinions
    Cherif, Walid
    Madani, Abdellah
    Kissi, Mohamed
    2016 11TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS: THEORIES AND APPLICATIONS (SITA), 2016,
  • [47] Feature selection based on feature interactions with application to text categorization
    Tang, Xiaochuan
    Dai, Yuanshun
    Xiang, Yanping
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 120 : 207 - 216
  • [48] Efficient n-gram construction for text categorization using feature selection techniques
    Garcia, Maximiliano
    Maldonado, Sebastian
    Vairetti, Carla
    INTELLIGENT DATA ANALYSIS, 2021, 25 (03) : 509 - 525
  • [49] Enhancement of DTP feature selection method for text categorization
    Moyotl-Hernández, E
    Jiménez-Salazar, H
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2005, 3406 : 719 - 722
  • [50] Applying cascaded feature selection to SVM text categorization
    Masuyama, T
    Nakagawa, H
    13TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2002, : 241 - 245