The hybrid representation model for web document classification

被引:10
|
作者
Markov, A. [1 ]
Last, M. [1 ]
Kandel, A. [2 ]
机构
[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel
[2] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA
关键词
D O I
10.1002/int.20290
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most Web content categorization methods are based on the vector space model of information retrieval. One of the most important advantages of this representation model is that it can be used by both instance-based and model-based classifiers. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the markup information that can easily-be extracted from the Web document HTML tags. A recently developed graph-based Web document representation model can preserve Web document structural information. It was shown to outperform the traditional vector representation using the k-Nearest Neighbor (k-NN) classification algorithm. The problem, however, is that the eager (model-based) classifiers cannot work with this representation directly. In this article, three new hybrid approaches to Web document classification are presented, built upon both graph and vector space representations, thus preserving the benefits and overcoming the limitations of each. The hybrid methods presented here are compared to vector-based models using the C4.5 decision tree and the probabilistic Naive Bayes classifiers on several benchmark Web document collections. The results demonstrate that the hybrid methods presented in this article outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant reduction in the classification time. (c) 2008 Wiley Periodicals, Inc.
引用
收藏
页码:654 / 679
页数:26
相关论文
共 50 条
  • [41] A PSO-based web document classification algorithm
    Ziqiang Wang
    Qingzhou Zhang
    Dexian Zhang
    SNPD 2007: EIGHTH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING, AND PARALLEL/DISTRIBUTED COMPUTING, VOL 3, PROCEEDINGS, 2007, : 659 - +
  • [42] PCCS: a fast clustering and classification method for Web document
    Wang, A.H.
    Zhang, M.
    Yang, D.Q.
    Tang, S.W.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2001, 38 (04):
  • [43] Web Document Classification using Support Vector Machine
    Shinde, Sharmila
    Joeg, Prasanna
    Vanjale, Sandeep
    2017 INTERNATIONAL CONFERENCE ON CURRENT TRENDS IN COMPUTER, ELECTRICAL, ELECTRONICS AND COMMUNICATION (CTCEEC), 2017, : 688 - 691
  • [44] Annotation based classification of the PDF document for semantic web
    Shukla, Archana
    ICECT 2011 - 2011 3rd International Conference on Electronics Computer Technology, 2011, 1 : 370 - 376
  • [45] An application of the nearest correlation matrix on web document classification
    Qi, Houduo
    Xia, Zhonghang
    Xing, Guangming
    JOURNAL OF INDUSTRIAL AND MANAGEMENT OPTIMIZATION, 2007, 3 (04) : 701 - 713
  • [46] Improving SVM on Web Content Classification by Document Formulation
    Xia, Tian
    Chai, Yanmei
    Wang, Tong
    PROCEEDINGS OF 2012 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION, VOLS I-VI, 2012, : 110 - 113
  • [47] A MA-Based Web Document Classification Algorithm
    Sun, Xia
    Wang, Ziqiang
    Zhang, Dexian
    2008 IEEE INTERNATIONAL SYMPOSIUM ON IT IN MEDICINE AND EDUCATION, VOLS 1 AND 2, PROCEEDINGS, 2008, : 952 - 955
  • [48] Web document classification based on extended rough set
    Yi, GX
    Hu, HP
    Lu, ZD
    PDCAT 2005: SIXTH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES, PROCEEDINGS, 2005, : 916 - 918
  • [49] Clonal selection algorithm based web document classification
    Hu, Xuanzi
    He, Dingxiu
    Journal of Information and Computational Science, 2010, 7 (02): : 551 - 557
  • [50] Statistical Methods for Performance Evaluation of WEB Document Classification
    Volovici, Daniel
    Breazu, Macarie
    Curea, Gabriel Dacian
    Morariu, Daniel Ionel
    STUDIES IN INFORMATICS AND CONTROL, 2010, 19 (02): : 169 - 176