The hybrid representation model for web document classification

被引:10
|
作者
Markov, A. [1 ]
Last, M. [1 ]
Kandel, A. [2 ]
机构
[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel
[2] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA
关键词
D O I
10.1002/int.20290
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most Web content categorization methods are based on the vector space model of information retrieval. One of the most important advantages of this representation model is that it can be used by both instance-based and model-based classifiers. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the markup information that can easily-be extracted from the Web document HTML tags. A recently developed graph-based Web document representation model can preserve Web document structural information. It was shown to outperform the traditional vector representation using the k-Nearest Neighbor (k-NN) classification algorithm. The problem, however, is that the eager (model-based) classifiers cannot work with this representation directly. In this article, three new hybrid approaches to Web document classification are presented, built upon both graph and vector space representations, thus preserving the benefits and overcoming the limitations of each. The hybrid methods presented here are compared to vector-based models using the C4.5 decision tree and the probabilistic Naive Bayes classifiers on several benchmark Web document collections. The results demonstrate that the hybrid methods presented in this article outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant reduction in the classification time. (c) 2008 Wiley Periodicals, Inc.
引用
收藏
页码:654 / 679
页数:26
相关论文
共 50 条
  • [21] Web document classification and its performance evaluation
    Pop, Ioan
    ADVANCED TOPICS ON EVOLUTIONARY COMPUTING, 2008, : 105 - 110
  • [22] Exploring Social Annotations for Web Document Classification
    Noll, Michael G.
    Meinel, Christoph
    APPLIED COMPUTING 2008, VOLS 1-3, 2008, : 2315 - 2320
  • [23] Unsupervised clustering for nontextual web document classification
    Chan, SWK
    Chong, MWC
    DECISION SUPPORT SYSTEMS, 2004, 37 (03) : 377 - 396
  • [24] Web document classification based on fuzzy association
    Haruechaiyasak, C
    Shyu, ML
    Chen, SC
    Li, XQ
    26TH ANNUAL INTERNATIONAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE, PROCEEDINGS, 2002, : 487 - 492
  • [25] Incremental document clustering for web page classification
    Wong, WC
    Fu, AWC
    ENABLING SOCIETY WITH INFORMATION TECHNOLOGY, 2002, : 101 - 110
  • [26] Web Document Classification Using MFA and MPM
    Sun, Xia
    Wang, Ziqiang
    2009 SECOND INTERNATIONAL CONFERENCE ON FUTURE INFORMATION TECHNOLOGY AND MANAGEMENT ENGINEERING, FITME 2009, 2009, : 349 - 352
  • [27] Progressive analysis scheme for web document classification
    Sung, LC
    Kuo, CH
    Chen, MC
    Sun, YL
    2005 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, PROCEEDINGS, 2005, : 606 - 609
  • [28] Web document classification based on rough set
    Duan, Qiguo
    Miao, Duoqian
    Chen, Min
    ROUGH SETS, FUZZY SETS, DATA MINING AND GRANULAR COMPUTING, PROCEEDINGS, 2007, 4482 : 240 - +
  • [29] Introducing Shadows: Flexible Document Representation and Annotation on the Web
    Mota, Matheus Silva
    Medeiros, Claudia Bauzer
    2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDEW), 2013, : 13 - 18
  • [30] A Hybrid Deep Representation Learning Model for Time Series Classification and Prediction
    Guo, Yang
    Wu, Zhenyu
    Ji, Yang
    2017 3RD INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM), 2017, : 226 - 231