Compensation strategy of unseen feature words in naive Bayes text classification

被引:0
|
作者
School of Management, Harbin Institute of Technology, Harbin 150001, China [1 ]
不详 [2 ]
机构
来源
Harbin Gongye Daxue Xuebao | 2008年 / 6卷 / 956-960期
关键词
Compensation strategy - Data smoothing - Feature words - Maximum entropy modeling - Naive Bayes classification - Smoothing algorithms - Statistical language modeling - Text classification;
D O I
暂无
中图分类号
学科分类号
摘要
When applied to deal with text classification task, naive Bayes is always suffered from the unseen feature words problem. Moreover, this problem is hardly to be solved by expanding the corpora for there is the sparse data problem in the corpora, in which the distribution of words complies with Zipf law. Inspired by statistical language model, a novel approach is proposed, which applies the smoothing algorithms to naive Bayes for text classification task to overcome the unseen feature words problem. The experimental corpora come from the data in National 863 Evaluation on text classification, and in the open test with removing the stop words, the naive Bayes performance with Good-Turing algorithm is 3.05% higher than that with Laplace, and 1.00% higher than that with Lidstone. And in the experiment with cross entropy extracting feature words, the performance of naive Bayes with Good-Turing algorithm is even 1.95% higher than that of Maximum Entropy model. The smoothing algorithm is helpful to solve the unseen feature words problem due to the sparse data.
引用
收藏
相关论文
共 50 条
  • [21] Techniques for improving the performance of naive Bayes for text classification
    Schneider, KM
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2005, 3406 : 682 - 693
  • [22] Two feature weighting approaches for naive Bayes text classifiers
    Zhang, Lungan
    Jiang, Liangxiao
    Li, Chaoqun
    Kong, Ganggang
    KNOWLEDGE-BASED SYSTEMS, 2016, 100 : 137 - 144
  • [23] A New Feature Selection Approach to Naive Bayes Text Classifiers
    Zhang, Lungan
    Jiang, Liangxiao
    Li, Chaoqun
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2016, 30 (02)
  • [24] Naive bayes text categorization using improved feature selection
    Lin, Kunhui
    Kang, Kai
    Huang, Yunping
    Zhou, Changle
    Wang, Beizhan
    Journal of Computational Information Systems, 2007, 3 (03): : 1159 - 1164
  • [25] Toward Optimal Feature Selection in Naive Bayes for Text Categorization
    Tang, Bo
    Kay, Steven
    He, Haibo
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (09) : 2508 - 2521
  • [26] Integrating incremental feature weighting into Naive Bayes text classifier
    Kim, Han Joon
    Chang, Jaeyoung
    PROCEEDINGS OF 2007 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2007, : 1137 - 1143
  • [27] Improved feature size customized fast correlation-based filter for Naive Bayes text classification
    Zhang, Yun
    Zhang, Yude
    He, Wei
    Yu, Shujuan
    Zhao, Shengmei
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 38 (03) : 3117 - 3127
  • [28] A Scalable Text Classification Using Naive Bayes with Hadoop Framework
    Temesgen, Mulualem Mheretu
    Lemma, Dereje Teferi
    INFORMATION AND COMMUNICATION TECHNOLOGY FOR DEVELOPMENT FOR AFRICA (ICT4DA 2019), 2019, 1026 : 291 - 300
  • [29] Topic document model approach for naive Bayes text classification
    Kim, SB
    Rim, HC
    Kim, JD
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2005, E88D (05): : 1091 - 1094
  • [30] Improved Naive Bayes with optimal correlation factor for text classification
    Chen, Jiangning
    Dai, Zhibo
    Duan, Juntao
    Matzinger, Heinrich
    Popescu, Ionel
    SN APPLIED SCIENCES, 2019, 1 (09):