Accelerating the process of web page segmentation via template clustering

被引:0
|
作者
Zeleny J. [1 ]
Burget R. [2 ]
机构
[1] Faculty of Information Technology, Brno University of Technology, Brno
[2] Faculty of Information Technology, Brno University of Technology, IT4Innovations Centre of Excellence, Brno
关键词
Clustering; Page segmentation; Segmentation performance; Template; Template detection; VIPS; Vision-based page segmentation; Web page preprocessing; Web page segmentation;
D O I
10.1504/IJIIDS.2016.075424
中图分类号
学科分类号
摘要
Page segmentation is often one of the initial steps when performing data mining on a web page. In the past years, several methods of page segmentation have been developed that are based on visual perception of the web page. In this paper, we propose a generic method for improving efficiency of virtually all vision-based segmentation algorithms. Our method called cluster-based page segmentation takes the widely spread concept of web templates and utilises it for improving the efficiency of vision-based page segmentation by clustering web pages and performing the segmentation on the clusters instead of each page in the cluster. To prove the efficiency of our algorithm, we offer experimental results gathered using three different vision-based segmentation algorithms. Copyright © 2016 Inderscience Enterprises Ltd.
引用
收藏
页码:134 / 154
页数:20
相关论文
共 50 条
  • [21] Clustering web sessions by levels of page similarity
    Nichele, Caren Moraes
    Becker, Karin
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2006, 3918 : 346 - 350
  • [22] Query clustering for boosting web page ranking
    BaezaYates, R
    Hurtado, C
    Mendoza, M
    ADVANCES IN WEB INTELLIGENCE, PROCEEDINGS, 2004, 3034 : 164 - 175
  • [23] Web page recommendation via twofold clustering: considering user behavior and topic relation
    Xie, Xianfen
    Wang, Binhui
    NEURAL COMPUTING & APPLICATIONS, 2018, 29 (01): : 235 - 243
  • [24] A Novel Method for the Web page Segmentation And Identification
    Wang, Jing
    Liu, Zhijing
    2009 INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND TECHNOLOGY, VOL I, PROCEEDINGS, 2009, : 229 - 231
  • [25] Web Page Template and Data Separation for Better Maintainability
    Zhao, Chenxu
    Zhang, Rui
    Qi, Jianzhong
    WEB INFORMATION SYSTEMS ENGINEERING, WISE 2018, PT I, 2018, 11233 : 439 - 449
  • [26] Web page segmentation based on Gestalt theory
    Xiang, Peifeng
    Yang, Xin
    Shi, Yuanchun
    2007 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-5, 2007, : 2253 - 2256
  • [27] A Web Page Segmentation Method based on Page Layouts and Title Blocks
    Sano, Hiroyuki
    Shiramatsu, Shun
    Ozono, Tadachika
    Shintani, Toramatsu
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2011, 11 (10): : 84 - 90
  • [28] Accelerating web page retrieval through object usage declaration
    Chi, CH
    Li, X
    Wang, HG
    37TH ANNUAL SIMULATION SYMPOSIUM, PROCEEDINGS, 2004, : 104 - 111
  • [29] Template Clustering for the Foundational Analysis of the Dark Web
    Nair, Viswajit Vinod
    van Staalduinen, Mark
    Oosterman, Dion T.
    2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 2542 - 2549
  • [30] Evaluation of web page representations by content through clustering
    Casillas, A
    Fresno, V
    de Lena, MTG
    Martínez, R
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2004, 3246 : 129 - 130