A Local-Concentration-Based Feature Extraction Approach for Spam Filtering

被引:35
|
作者
Zhu, Yuanchun [1 ,2 ]
Tan, Ying [1 ,2 ]
机构
[1] Peking Univ, Sch Elect Engn & Comp Sci, Key Lab Machine Percept, Minist Educ, Beijing 100871, Peoples R China
[2] Peking Univ, Sch Elect Engn & Comp Sci, Dept Machine Intelligence, Beijing 100871, Peoples R China
基金
国家高技术研究发展计划(863计划); 中国国家自然科学基金;
关键词
Artificial immune system (AIS); bag-of-words (BoW); feature extraction; global concentration (GC); local concentration (LC); spam filtering;
D O I
10.1109/TIFS.2010.2103060
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Inspired from the biological immune system, we propose a local concentration (LC)-based feature extraction approach for anti-spam. The LC approach is considered to be able to effectively extract position-correlated information from messages by transforming each area of a message to a corresponding LC feature. Two implementation strategies of the LC approach are designed using a fixed-length sliding window and a variable-length sliding window. To incorporate the LC approach into the whole process of spam filtering, a generic LC model is designed. In the LC model, two types of detector sets are at first generated by using term selection methods and a well-defined tendency threshold. Then a sliding window is adopted to divide the message into individual areas. After segmentation of the message, the concentration of detectors is calculated and taken as the feature for each local area. Finally, all the features of local areas are combined as a feature vector of the message. To evaluate the proposed LC model, several experiments are conducted on five benchmark corpora using the cross-validation method. It is shown that the LC approach cooperates well with three term selection methods, which endows it with flexible applicability in the real world. Compared to the global-concentration-based approach and the prevalent bag-of-words approach, the LC approach has better performance in terms of both accuracy and measure. It is also demonstrated that the LC approach is robust against messages with variable message length.
引用
收藏
页码:486 / 497
页数:12
相关论文
共 50 条
  • [41] Anti-spam filtering: A centroid-based classification approach
    Soonthornphisaj, N
    Chaikulseriwat, K
    Tang-On, P
    2002 6TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS, VOLS I AND II, 2002, : 1096 - 1099
  • [42] Relaxing feature selection in spam filtering by using case-based reasoning systems
    Mendez, J. R.
    Fdez-Riverola, F.
    Glez-Pena, D.
    Diaz, F.
    Corchado, J. M.
    PROGRESS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2007, 4874 : 53 - +
  • [43] A Mixed Method for Feature Extraction Based on Resonance Filtering
    Zhang, Xia
    Lu, Wei
    Ding, Youwei
    Song, Yihua
    Xia, Jinyue
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 35 (03): : 3141 - 3154
  • [44] Kernel-Based Feature Extraction For Collaborative Filtering
    Sathe, Saket
    Aggarwal, Charu C.
    Kong, Xiangnan
    Liu, Xinyue
    2017 17TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2017, : 1057 - 1062
  • [45] A robust approach based on local feature extraction for age invariant face recognition
    Rajesh Kumar Tripathi
    Anand Singh Jalal
    Multimedia Tools and Applications, 2022, 81 : 21223 - 21240
  • [46] A robust approach based on local feature extraction for age invariant face recognition
    Tripathi, Rajesh Kumar
    Jalal, Anand Singh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (15) : 21223 - 21240
  • [47] Local-concentration-based descriptor predicting the stacking fault energy of refractory high-entropy alloys
    Ma, Cong
    Gao, Wang
    PHYSICAL REVIEW MATERIALS, 2023, 7 (11)
  • [48] A fuzzy similarity approach for automated spam filtering
    El-Alfy, El-Sayed M.
    Al-Qunaieer, Fares S.
    2008 IEEE/ACS INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, VOLS 1-3, 2008, : 544 - +
  • [49] Combining SVM with Orthogonal Centroid Feature Selection for Spam Filtering
    Zhou, Hong-liang
    Luo, Chang-yong
    INTERNATIONAL CONFERENCE ON COMPUTER, NETWORK SECURITY AND COMMUNICATION ENGINEERING (CNSCE 2014), 2014, : 290 - 297
  • [50] Applying cost-sensitive multiobjective genetic programming to feature extraction for spam e-mail filtering
    Zhang, Yang
    Li, HongYu
    Niranjan, Mahesan
    Rockettl, Peter
    GENETIC PROGRAMMING, PROCEEDINGS, 2008, 4971 : 325 - +