The hybrid representation model for web document classification

被引：10

作者：

Markov, A. ^{[1
]}

Last, M. ^{[1
]}

Kandel, A. ^{[2
]}

机构：

[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel

[2] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA

来源：

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS | 2008年 / 23卷 / 06期

关键词：

D O I：

10.1002/int.20290

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most Web content categorization methods are based on the vector space model of information retrieval. One of the most important advantages of this representation model is that it can be used by both instance-based and model-based classifiers. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the markup information that can easily-be extracted from the Web document HTML tags. A recently developed graph-based Web document representation model can preserve Web document structural information. It was shown to outperform the traditional vector representation using the k-Nearest Neighbor (k-NN) classification algorithm. The problem, however, is that the eager (model-based) classifiers cannot work with this representation directly. In this article, three new hybrid approaches to Web document classification are presented, built upon both graph and vector space representations, thus preserving the benefits and overcoming the limitations of each. The hybrid methods presented here are compared to vector-based models using the C4.5 decision tree and the probabilistic Naive Bayes classifiers on several benchmark Web document collections. The results demonstrate that the hybrid methods presented in this article outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant reduction in the classification time. (c) 2008 Wiley Periodicals, Inc.

引用

页码：654 / 679

页数：26

共 50 条

[21] Web document classification and its performance evaluation
Pop, Ioan
ADVANCED TOPICS ON EVOLUTIONARY COMPUTING, 2008, : 105 - 110
[22] Exploring Social Annotations for Web Document Classification
Noll, Michael G.
Meinel, Christoph
APPLIED COMPUTING 2008, VOLS 1-3, 2008, : 2315 - 2320
[23] Unsupervised clustering for nontextual web document classification
Chan, SWK
Chong, MWC
DECISION SUPPORT SYSTEMS, 2004, 37 (03) : 377 - 396
[24] Web document classification based on fuzzy association
Haruechaiyasak, C
Shyu, ML
Chen, SC
Li, XQ
26TH ANNUAL INTERNATIONAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE, PROCEEDINGS, 2002, : 487 - 492
[25] Incremental document clustering for web page classification
Wong, WC
Fu, AWC
ENABLING SOCIETY WITH INFORMATION TECHNOLOGY, 2002, : 101 - 110
[26] Web Document Classification Using MFA and MPM
Sun, Xia
Wang, Ziqiang
2009 SECOND INTERNATIONAL CONFERENCE ON FUTURE INFORMATION TECHNOLOGY AND MANAGEMENT ENGINEERING, FITME 2009, 2009, : 349 - 352
[27] Progressive analysis scheme for web document classification
Sung, LC
Kuo, CH
Chen, MC
Sun, YL
2005 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, PROCEEDINGS, 2005, : 606 - 609
[28] Web document classification based on rough set
Duan, Qiguo
Miao, Duoqian
Chen, Min
ROUGH SETS, FUZZY SETS, DATA MINING AND GRANULAR COMPUTING, PROCEEDINGS, 2007, 4482 : 240 - +
[29] Introducing Shadows: Flexible Document Representation and Annotation on the Web
Mota, Matheus Silva
Medeiros, Claudia Bauzer
2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDEW), 2013, : 13 - 18
[30] A Hybrid Deep Representation Learning Model for Time Series Classification and Prediction
Guo, Yang
Wu, Zhenyu
Ji, Yang
2017 3RD INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM), 2017, : 226 - 231

← 1 2 3 4 5 →