A new feature selection algorithm based on binomial hypothesis testing for spam filtering

被引:53
作者
Yang, Jieming [1 ,2 ]
Liu, Yuanning [1 ]
Liu, Zhen [1 ,3 ]
Zhu, Xiaodong [1 ]
Zhang, Xiaoxu [1 ]
机构
[1] Jilin Univ, Coll Comp Sci & Technol, Changchun 130023, Jilin, Peoples R China
[2] NE Dianli Univ, Coll Informat Engn, Changchun, Jilin, Peoples R China
[3] Nagasaki Inst Appl Sci, Grad Sch Engn, Nagasaki, Japan
基金
中国国家自然科学基金;
关键词
Feature selection; Binomial hypothesis testing; Spam filtering; Text categorization; Binomial distribution; STATISTICAL COMPARISONS; CLASSIFIERS;
D O I
10.1016/j.knosys.2011.04.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in spam filtering. We proposed a new method, named Bi-Test, which utilizes binomial hypothesis testing to estimate whether the probability of a feature belonging to the spam satisfies a given threshold or not. We have evaluated Bi-Test on six benchmark spam corpora (pu1, pu2, pu3, pua, lingspam and CSDMC2010), using two classification algorithms, Naive Bayes (NB) and Support Vector Machines (SVM), and compared it with four famous feature selection algorithms (information gain, chi(2)-statistic, improved Gini index and Poisson distribution). The experiments show that Bi-Test performs significantly better than chi(2)-statistic and Poisson distribution, and produces comparable performance with information gain and improved Gini index in terms of F1 measure when Naive Bayes classifier is used; it achieves comparable performance with the other methods when SVM classifier is used. Moreover, Bi-Test executes faster than the other four algorithms. (C) 2011 Elsevier B.V. All rights reserved.
引用
收藏
页码:904 / 914
页数:11
相关论文
共 40 条
[1]  
ANDROUTSOPOULOS I, 2004, TECHNICAL REPORT NCS
[2]  
Androutsopoulos I., 2000, Proceedings of the Workshop on Machine Learning in the New Information Age, P9
[3]  
[Anonymous], 2009, R: A Language and Environment for Statistical Computing
[4]  
[Anonymous], 1994, MACHINE LEARNING P 1, DOI DOI 10.1016/B978-1-55860-335-6.50023-4
[5]   Selection of relevant features and examples in machine learning [J].
Blum, AL ;
Langley, P .
ARTIFICIAL INTELLIGENCE, 1997, 97 (1-2) :245-271
[6]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[7]   Feature selection for text classification with Naive Bayes [J].
Chen, Jingnian ;
Huang, Houkuan ;
Tian, Shengfeng ;
Qu, Youli .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) :5432-5435
[8]   A preprocess algorithm of filtering irrelevant information based on the minimum class difference [J].
Chen, Zhiping ;
Lu, Kevin .
KNOWLEDGE-BASED SYSTEMS, 2006, 19 (06) :422-429
[9]  
Conover W. J., 1999, Wiley Series in Probability and Statistics
[10]  
Demsar J, 2006, J MACH LEARN RES, V7, P1