Clustering-based topical Web crawling using CFu-tree guided by link-context

被引：0

作者：

Lu Liu

Tao Peng

机构：

[1] Jilin University,College of Computer Science and Technology

[2] University of Illinois at Urbana-Champaign,Department of Computer Science

来源：

Frontiers of Computer Science | 2014年 / 8卷

关键词：

topical Web crawling; comparison variation (CV); cluster impurity (CIP); CFu-tree; link-context; clustering;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Topical Web crawling is an established technique for domain-specific information retrieval. However, almost all the conventional topical Web crawlers focus on building crawlers using different classifiers, which needs a lot of labeled training data that is very difficult to labelmanually. This paper presents a novel approach called clustering-based topical Web crawling which is utilized to retrieve information on a specific domain based on link-context and does not require any labeled training data. In order to collect domain-specific content units, a novel hierarchical clustering method called bottom-up approach is used to illustrate the process of clustering where a new data structure, a linked list in combination with CFu-tree, is implemented to store cluster label, feature vector and content unit. During clustering, four metrics are presented. First, comparison variation (CV) is defined to judge whether the closest pair of clusters can be merged. Second, cluster impurity (CIP) evaluates the cluster error. Then, the precision and recall of clustering are also presented to evaluate the accuracy and comprehensive degree of the whole clustering process. Link-context extraction technique is used to expand the feature vector of anchor text which improves the clustering accuracy greatly. Experimental results show that the performance of our proposed method overcomes conventional focused Web crawlers both in Harvest rate and Targetrecall.

引用

页码：581 / 595

页数：14

共 24 条

[21] Transform Mapping Using Shared Decision Tree Context Clustering for HMM-Based Cross-Lingual Speech Synthesis
Nagahama, Daiki
Nose, Takashi
Koriyama, Tomoki
Kobayashi, Takao
15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 770 - 774
[22] WSCOVER: A Tool for Automatic Composition and Verification of Web Services Using Heuristic-Guided Model Checking and Logic-Based Clustering
Huynh, Khai T.
Bui, Thang H.
Than Tho Quan
MULTI-DISCIPLINARY TRENDS IN ARTIFICIAL INTELLIGENCE, (MIWAI 2016), 2016, 10053 : 50 - 62
[23] ACD3GPSO: automatic clustering-based algorithm for multi-robot task allocation using dynamic distributed double-guided particle swarm optimization
Ayari, Asma
Bouamama, Sadok
ASSEMBLY AUTOMATION, 2020, 40 (02) : 235 - 247
[24] A novel approach to estimate emissions from large transportation networks: Hierarchical clustering-based link-driving-schedules for EPA-MOVES using dynamic time warping measures
Aziz, H. M. Abdul
Ukkusuri, Satish V.
INTERNATIONAL JOURNAL OF SUSTAINABLE TRANSPORTATION, 2018, 12 (03) : 192 - 204

← 1 2 3 →