Improving Naive Bayes by Reducing the Importance of Low-Frequency Words Based on Entropy of Words for Spam Email Classification

被引:0
|
作者
Trikanjananun, Phaiboon [1 ]
Numsomran, Arjin [1 ]
Tipsuwannaporn, Vittaya [1 ]
机构
[1] King Mongkuts Inst Technol Ladkrabang, Sch Engn, Dept Instrumentat & Control Engn, Bangkok, Thailand
关键词
Naive bayes; NB algorithm; Spam email classification;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Naive Bayes algorithm (NB algorithm) is a popular one for spam email classification due to fast training, using simple techniques and high accuracy. One of many research improving NB algorithms are the AWF-NB algorithm. In this paper, we call the research an AWF-algorithm for convenient mention. The AWF-NB algorithm focuses on solving the equally important word in each class because it is not always the case. Another problem of the NB algorithm to solve this problem, the AWF-NB extremely reduces the importance of words in the class that has lower importance. However, this action will lead to reducing the accuracy in cases that slightly differ among the importance of words in each class. Therefore, the goal of the research is to improve the AWF-NB algorithm by reducing the importance of words based on entropy of words. We compute the entropy of a word to decide if it should be reduced in importance. The experimental results on ten spam email datasets from Kaggle website indicated that the RIWE-NB algorithm can remarkably increase the classification accuracy of the NB algorithm and the AWF-NB algorithm in majority datasets while the execution time is still conserved.
引用
收藏
页码:10 / 14
页数:5
相关论文
共 3 条
  • [1] Email Spam Classification using Neighbor Probability based Naive Bayes Algorithm
    Anitha, P. U.
    Rao, C. V. Guru
    Babu, Suresh
    2017 7TH INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORK TECHNOLOGIES (CSNT), 2017, : 350 - 355
  • [2] Text and Image Based Spam Email Classification using KNN, Naive Bayes and Reverse DBSCAN Algorithm
    Harisinghaney, Anirudh
    Dixit, Aman
    Gupta, Saurabh
    Arora, Anuja
    PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON RELIABILTY, OPTIMIZATION, & INFORMATION TECHNOLOGY (ICROIT 2014), 2014, : 153 - 155
  • [3] Efficient Estimate of Low-Frequency Words' Embeddings Based on the Dictionary: A Case Study on Chinese
    Liao, Xianwen
    Huang, Yongzhong
    Wei, Changfu
    Zhang, Chenhao
    Deng, Yongqing
    Yi, Ke
    APPLIED SCIENCES-BASEL, 2021, 11 (22):