Clustering-based topical Web crawling using CFu-tree guided by link-context

被引:0
|
作者
Lu Liu
Tao Peng
机构
[1] Jilin University,College of Computer Science and Technology
[2] University of Illinois at Urbana-Champaign,Department of Computer Science
来源
关键词
topical Web crawling; comparison variation (CV); cluster impurity (CIP); CFu-tree; link-context; clustering;
D O I
暂无
中图分类号
学科分类号
摘要
Topical Web crawling is an established technique for domain-specific information retrieval. However, almost all the conventional topical Web crawlers focus on building crawlers using different classifiers, which needs a lot of labeled training data that is very difficult to labelmanually. This paper presents a novel approach called clustering-based topical Web crawling which is utilized to retrieve information on a specific domain based on link-context and does not require any labeled training data. In order to collect domain-specific content units, a novel hierarchical clustering method called bottom-up approach is used to illustrate the process of clustering where a new data structure, a linked list in combination with CFu-tree, is implemented to store cluster label, feature vector and content unit. During clustering, four metrics are presented. First, comparison variation (CV) is defined to judge whether the closest pair of clusters can be merged. Second, cluster impurity (CIP) evaluates the cluster error. Then, the precision and recall of clustering are also presented to evaluate the accuracy and comprehensive degree of the whole clustering process. Link-context extraction technique is used to expand the feature vector of anchor text which improves the clustering accuracy greatly. Experimental results show that the performance of our proposed method overcomes conventional focused Web crawlers both in Harvest rate and Targetrecall.
引用
收藏
页码:581 / 595
页数:14
相关论文
共 24 条
  • [1] Clustering-based topical Web crawling using CFu-tree guided by link-context
    Liu, Lu
    Peng, Tao
    FRONTIERS OF COMPUTER SCIENCE, 2014, 8 (04) : 581 - 595
  • [2] Adaptive topical web crawling for domain-specific resource discovery guided by link-context
    Peng, Tao
    He, Fengling
    Zuo, Wanli
    Zhang, Changli
    MICAI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4293 : 963 - +
  • [3] Topical Web Crawling for Doniain-Specific Resource Discovery Enhanced by Selectively using Link-Context
    Liu, Lu
    Peng, Tao
    Zuo, Wanli
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2015, 12 (02) : 196 - 204
  • [4] Clustering-Based Topical Web Crawling for Topic-Specific Information Retrieval Guided by Incremental Classifier
    Peng, Tao
    Liu, Lu
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2015, 25 (01) : 147 - 168
  • [5] Clustering-Based Incremental Web Crawling
    Tan, Qingzhao
    Mitra, Prasenjit
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2010, 28 (04)
  • [6] A novel incremental conceptual hierarchical text clustering method using CFu-tree
    Peng, Tao
    Liu, Lu
    APPLIED SOFT COMPUTING, 2015, 27 : 269 - 278
  • [7] A Clustering-based Approach to Web Image Context Extraction
    Alcic, Sadet
    Conrad, Stefan
    PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCES ON ADVANCES IN MULTIMEDIA (MMEDIA 2011), 2011, : 74 - 79
  • [8] Using a joint link similarity evaluation based method for crawling the resources on Web
    Zhang N.-Z.
    Li S.-J.
    Yu W.
    Zhang Z.
    Jisuanji Xuebao/Chinese Journal of Computers, 2010, 33 (12): : 2266 - 2280
  • [9] Improving link prediction in social networks using local and global features: a clustering-based approach
    S. Ghasemi
    A. Zarei
    Progress in Artificial Intelligence, 2022, 11 : 79 - 92
  • [10] Improving link prediction in social networks using local and global features: a clustering-based approach
    Ghasemi, S.
    Zarei, A.
    PROGRESS IN ARTIFICIAL INTELLIGENCE, 2022, 11 (01) : 79 - 92