The hybrid representation model for web document classification

被引:10
|
作者
Markov, A. [1 ]
Last, M. [1 ]
Kandel, A. [2 ]
机构
[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel
[2] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA
关键词
D O I
10.1002/int.20290
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most Web content categorization methods are based on the vector space model of information retrieval. One of the most important advantages of this representation model is that it can be used by both instance-based and model-based classifiers. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the markup information that can easily-be extracted from the Web document HTML tags. A recently developed graph-based Web document representation model can preserve Web document structural information. It was shown to outperform the traditional vector representation using the k-Nearest Neighbor (k-NN) classification algorithm. The problem, however, is that the eager (model-based) classifiers cannot work with this representation directly. In this article, three new hybrid approaches to Web document classification are presented, built upon both graph and vector space representations, thus preserving the benefits and overcoming the limitations of each. The hybrid methods presented here are compared to vector-based models using the C4.5 decision tree and the probabilistic Naive Bayes classifiers on several benchmark Web document collections. The results demonstrate that the hybrid methods presented in this article outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant reduction in the classification time. (c) 2008 Wiley Periodicals, Inc.
引用
收藏
页码:654 / 679
页数:26
相关论文
共 50 条
  • [1] Distributed Document Representation for Document Classification
    Li, Rumeng
    Shindo, Hiroyuki
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PART I, 2015, 9077 : 212 - 225
  • [2] Hybrid Text Mining Model for Document Classification
    Vidhya, K. A.
    Aghila, G.
    2010 2ND INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING (ICCAE 2010), VOL 1, 2010, : 210 - 214
  • [3] Wikipedia-Based Hybrid Document Representation for Textual News Classification
    Mourino Garcia, Marcos Antonio
    Perez Rodriguez, Roberto
    Anido Rifon, Luis
    Vilares Ferro, Manuel
    2016 3RD INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI 2016), 2016, : 148 - 153
  • [4] Hybrid Neural Network Model for Web Document Clustering
    Hemalatha, M.
    Srinivas, Sathya D.
    2009 SECOND INTERNATIONAL CONFERENCE ON THE APPLICATIONS OF DIGITAL INFORMATION AND WEB TECHNOLOGIES (ICADIWT 2009), 2009, : 531 - +
  • [5] Wikipedia-based hybrid document representation for textual news classification
    Marcos Antonio Mouriño-García
    Roberto Pérez-Rodríguez
    Luis Anido-Rifón
    Manuel Vilares-Ferro
    Soft Computing, 2018, 22 : 6047 - 6065
  • [6] Wikipedia-based hybrid document representation for textual news classification
    Antonio Mourino-Garcia, Marcos
    Perez-Rodriguez, Roberto
    Anido-Rifon, Luis
    Vilares-Ferro, Manuel
    SOFT COMPUTING, 2018, 22 (18) : 6047 - 6065
  • [7] Effectiveness of document representation for classification
    Chen, DY
    Li, X
    Dong, ZY
    Chen, X
    DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2005, 3589 : 368 - 377
  • [8] Vietnamese Document Representation and Classification
    Nguyen, Giang-Son
    Gao, Xiaoying
    Andreae, Peter
    AI 2009: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2009, 5866 : 577 - 586
  • [9] Improved algorithm of Web document representation based on vector space model
    College of Hydroelectricity and Digital Engineering, Huazhong University of Science and Technology, Wuhan 430074, China
    Jisuanji Gongcheng, 2006, 3 (134-135+139):
  • [10] Rich document representation and classification: An analysis
    Keikha, Mostafa
    Khonsari, Ahmad
    Oroumchian, Farhad
    KNOWLEDGE-BASED SYSTEMS, 2009, 22 (01) : 67 - 71