Compensation strategy of unseen feature words in naive Bayes text classification

被引：0

作者：

School of Management, Harbin Institute of Technology, Harbin 150001, China ^{[1
]}

不详 ^{[2
]}

机构：

来源：

Harbin Gongye Daxue Xuebao | 2008年 / 6卷 / 956-960期

关键词：

Compensation strategy - Data smoothing - Feature words - Maximum entropy modeling - Naive Bayes classification - Smoothing algorithms - Statistical language modeling - Text classification;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

When applied to deal with text classification task, naive Bayes is always suffered from the unseen feature words problem. Moreover, this problem is hardly to be solved by expanding the corpora for there is the sparse data problem in the corpora, in which the distribution of words complies with Zipf law. Inspired by statistical language model, a novel approach is proposed, which applies the smoothing algorithms to naive Bayes for text classification task to overcome the unseen feature words problem. The experimental corpora come from the data in National 863 Evaluation on text classification, and in the open test with removing the stop words, the naive Bayes performance with Good-Turing algorithm is 3.05% higher than that with Laplace, and 1.00% higher than that with Lidstone. And in the experiment with cross entropy extracting feature words, the performance of naive Bayes with Good-Turing algorithm is even 1.95% higher than that of Maximum Entropy model. The smoothing algorithm is helpful to solve the unseen feature words problem due to the sparse data.

引用

共 50 条

[21] Techniques for improving the performance of naive Bayes for text classification
Schneider, KM
COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2005, 3406 : 682 - 693
[22] Two feature weighting approaches for naive Bayes text classifiers
Zhang, Lungan
Jiang, Liangxiao
Li, Chaoqun
Kong, Ganggang
KNOWLEDGE-BASED SYSTEMS, 2016, 100 : 137 - 144
[23] A New Feature Selection Approach to Naive Bayes Text Classifiers
Zhang, Lungan
Jiang, Liangxiao
Li, Chaoqun
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2016, 30 (02)
[24] Naive bayes text categorization using improved feature selection
Lin, Kunhui
Kang, Kai
Huang, Yunping
Zhou, Changle
Wang, Beizhan
Journal of Computational Information Systems, 2007, 3 (03): : 1159 - 1164
[25] Toward Optimal Feature Selection in Naive Bayes for Text Categorization
Tang, Bo
Kay, Steven
He, Haibo
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (09) : 2508 - 2521
[26] Integrating incremental feature weighting into Naive Bayes text classifier
Kim, Han Joon
Chang, Jaeyoung
PROCEEDINGS OF 2007 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2007, : 1137 - 1143
[27] Improved feature size customized fast correlation-based filter for Naive Bayes text classification
Zhang, Yun
Zhang, Yude
He, Wei
Yu, Shujuan
Zhao, Shengmei
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 38 (03) : 3117 - 3127
[28] A Scalable Text Classification Using Naive Bayes with Hadoop Framework
Temesgen, Mulualem Mheretu
Lemma, Dereje Teferi
INFORMATION AND COMMUNICATION TECHNOLOGY FOR DEVELOPMENT FOR AFRICA (ICT4DA 2019), 2019, 1026 : 291 - 300
[29] Topic document model approach for naive Bayes text classification
Kim, SB
Rim, HC
Kim, JD
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2005, E88D (05): : 1091 - 1094
[30] Improved Naive Bayes with optimal correlation factor for text classification
Chen, Jiangning
Dai, Zhibo
Duan, Juntao
Matzinger, Heinrich
Popescu, Ionel
SN APPLIED SCIENCES, 2019, 1 (09):

← 1 2 3 4 5 →