Stemming versus light stemming as feature selection techniques for Arabic text categorization

被引:0
|
作者
Duwairi, Rehab
Al-Refai, Mohammad
Khasawneh, Natheer
机构
关键词
text categorization; feature selection; Arabic language; stemming; light-stemming; K-nearest neighbors classifier;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
This paper compares and contrasts two feature selection techniques when applied to Arabic corpus; in particular; stemming, and light stemming were employed. With stemming, words are reduced to their stems. With light stemming words are reduced to their light stems. Stemming is aggressive in the sense that it reduces words to their 3-letters roots. This affects the semantics as several words with different meanings might have the same root. Light stemming, by comparison, removes frequently used prefixes and suffixes in Arabic words. Light stemming doesn't produce the root and therefore doesn't affect the semantics of words; it maps several words, which have the same meaning to a common syntactical form. The effectiveness of above two feature selection techniques was assessed in a text categorization exercise for Arabic corpus. This corpus consists of 15000 documents that fall into three categories. The K-nearest neighbors (KNN) classifier was used in this work. Several experiments were carried out using two different representations of the same corpus; the first version uses stem-vectors; and the second uses light stem-vectors as representatives of documents. These two representations were assessed in terms of size, time and accuracy. The light stem representation was superior in terms of classifier accuracy when compared with stemming.
引用
收藏
页码:199 / 203
页数:5
相关论文
共 50 条
  • [31] A New Approach of Feature Selection for Text Categorization
    CUI Zifeng~1
    2. Department of Computer Science and Engineering
    WuhanUniversityJournalofNaturalSciences, 2006, (05) : 1335 - 1339
  • [32] Normalized and classified feature selection in text categorization
    Wang, XJ
    Guo, J
    Zheng, KF
    INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS AND INFORMATION TECHNOLOGIES 2005, VOLS 1 AND 2, PROCEEDINGS, 2005, : 173 - 176
  • [33] Improving Text Categorization by Multicriteria Feature Selection
    Doan, Son
    Horiguchi, Susumu
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2005, 9 (05) : 570 - 575
  • [34] Study on Feature Selection in Finance Text Categorization
    Sun, Changqiu
    Wang, Xiaolong
    Xu, Jun
    2009 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2009), VOLS 1-9, 2009, : 5077 - 5082
  • [35] A novel feature selection algorithm for text categorization
    Shang, Wenqian
    Huang, Houkuan
    Zhu, Haibin
    Lin, Yongmin
    Qu, Youli
    Wang, Zhihai
    EXPERT SYSTEMS WITH APPLICATIONS, 2007, 33 (01) : 1 - 5
  • [36] A new approach to feature selection for text categorization
    Li, SS
    Zong, CQ
    PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 626 - 630
  • [37] A hybrid feature selection method for text categorization
    Montanes, E.
    Quevedo, J. R.
    Combarro, E. F.
    Diaz, I.
    Ranilla, J.
    INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2007, 15 (02) : 133 - 151
  • [38] Words as rules:: Feature selection in text categorization
    Montañés, E
    Combarro, EF
    Díaz, I
    Ranilla, J
    Quevedo, JR
    COMPUTATIONAL SCIENCE - ICCS 2004, PT 1, PROCEEDINGS, 2004, 3036 : 666 - 669
  • [39] Cascaded feature selection in SVMs text categorization
    Masuyama, T
    Nakagawa, H
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, PROCEEDINGS, 2003, 2588 : 588 - 591
  • [40] Study on constraints for feature selection in text categorization
    Xu, Yan
    Li, Jintao
    Wang, Bin
    Sun, Chunming
    Zhang, Sen
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2008, 45 (04): : 596 - 602