Towards Web Spam Filtering using a Classifier based on the Minimum Description Length Principle

被引:0
|
作者
Silva, Renato M. [1 ]
Yamakami, Akebo [1 ]
Almeida, Tiago A. [2 ]
机构
[1] Univ Campinas UNICAMP, Sch Elect & Comp Engn, Sao Paulo, Brazil
[2] Fed Univ Sao Carlos UFSCar, Dept Comp Sci, Sao Paulo, Brazil
基金
巴西圣保罗研究基金会;
关键词
D O I
10.1109/ICMLA.2016.170
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The steady growth and popularization of the Web has led spammers to develop techniques to circumvent search engines aiming good visibility to their web pages in search results. They are responsible for serious problems such as dissatisfaction, irritation, exposure to unpleasant or malicious content, and financial loss. Despite different machine learning approaches have been used to detect web spam, many of them suffer with the curse of dimensionality or require a very high computational cost impeding their employment in real scenarios. In this way, there is still a big effort to develop more advanced methods that at the same time are able to prevent overfitting and fast to learn. To fill this gap, we present the MDLClass, a classifier technique based on the minimum description length principle, applied to the context of web spam filtering. The proposed method is very efficient, lightweight, multi-class, and fast. We also evaluated a new approach to detect web spam that combines the predictions obtained by the classifiers using content-based, link-based, and transformed link-based features. In our experiments, we employed two real, public and large datasets: the WEBSPAM-UK2006 and the WEBSPAM-UK2007. The results indicate that the proposed MDLClass and ensemble of predictions using different types of features are promising in the task of web spam filtering.
引用
收藏
页码:470 / 475
页数:6
相关论文
共 50 条
  • [1] Histograms based on the minimum description length principle
    Hai Wang
    Kenneth C. Sevcik
    The VLDB Journal, 2008, 17 : 419 - 442
  • [2] Histograms based on the minimum description length principle
    Wang, Hai
    Sevcik, Kenneth C.
    VLDB JOURNAL, 2008, 17 (03): : 419 - 442
  • [3] Towards Web Spam Filtering with Neural-Based Approaches
    Silva, Renato Moraes
    Almeida, Tiago A.
    Yamakami, Akebo
    ADVANCES IN ARTIFICIAL INTELLIGENCE - IBERAMIA 2012, 2012, 7637 : 199 - 209
  • [4] Model selection using the minimum description length principle
    Bryant, PG
    Cordero-Braña, OI
    AMERICAN STATISTICIAN, 2000, 54 (04): : 257 - 268
  • [5] Introducing the minimum description length principle
    Grünwald, P
    ADVANCES IN MINIMUM DESCRIPTION LENGTH THEORY AND APPLICATIONS, 2005, : 3 - 21
  • [6] A minimum description length principle for perception
    Chater, N
    ADVANCES IN MINIMUM DESCRIPTION LENGTH THEORY AND APPLICATIONS, 2005, : 385 - 409
  • [7] Bloat control and generalization pressure using the minimum description length principle for a Pittsburgh approach learning classifier system
    Bacardit, Jaume
    Garrell, Josep Maria
    LEARNING CLASSIFIER SYSTEMS, 2007, 4399 : 59 - 79
  • [8] INFERRING DECISION TREES USING THE MINIMUM DESCRIPTION LENGTH PRINCIPLE
    QUINLAN, JR
    RIVEST, RL
    INFORMATION AND COMPUTATION, 1989, 80 (03) : 227 - 248
  • [9] Regression spline smoothing using the minimum description length principle
    Lee, TCM
    STATISTICS & PROBABILITY LETTERS, 2000, 48 (01) : 71 - 82
  • [10] Cluster Validity Measures Based on the Minimum Description Length Principle
    Georgieva, Olga
    Tschumitschew, Katharina
    Klawonn, Frank
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT I: 15TH INTERNATIONAL CONFERENCE, KES 2011, 2011, 6881 : 82 - 89