Automatic template detection for structured web pages

被引:0
|
作者
Lo, Lawrence [1 ]
Ng, Vincent To-Yee [1 ]
Ng, Patrick [1 ]
Chan, Stephen C. F. [1 ]
机构
[1] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Hong Kong, Peoples R China
来源
2006 10TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, PROCEEDINGS, VOLS 1 AND 2 | 2006年
关键词
collaborative system; webpage template construction; XML;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Similar web pages of web sites on the World Wide Web are usually encoded from an underlying structured source, and generated dynamically from a pre-defined template, such as books' information pages in Amazon. com. By given a set of web pages from a common website, it is possible to extract the template by analyzing common patterns between the web pages. In our work, we developed the CF-EXALG (Collaborative Finer-EXALG), based on EXALG, to decompose web pages and finding their common structures. In our system, templates that are used to generate web pages can be discovered automatically and stored in XML format. Hence, data encoded in web pages can be easily extracted and the template can be stored for future manipulation. In our preliminary experiments, CF-EXALG has shown to be more accurate and efficient when compared with other similar systems.
引用
收藏
页码:708 / 713
页数:6
相关论文
共 50 条
  • [31] Adaptively extracting structured data from Web pages
    Guo, Yingnan
    Zhang, Jiajun
    Chen, Xing
    2019 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2019), 2019, : 1524 - 1525
  • [32] Structured web pages management for efficient data retrieval
    Taniar, D
    Jiang, Y
    Rahayu, JW
    Bishay, L
    PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS ENGINEERING, VOL II, 2000, : 97 - 104
  • [33] Extracting structured data from web pages (poster)
    Arasu, A
    Garcia-Molina, H
    19TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2003, : 698 - 698
  • [34] BlockWeb: an IR Model for Block Structured Web Pages
    Bruno, Emmanuel
    Faessel, Nicolas
    Le Maitre, Jacques
    Scholl, Michel
    CBMI: 2009 INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA INDEXING, 2009, : 219 - +
  • [35] Detection on large amount of web pages based on a highly active way of site-level template
    Zuo, Xiangang
    Zhang, Zhixia
    Xie, Jianping
    International Journal of Hybrid Information Technology, 2015, 8 (04): : 259 - 266
  • [36] Bridging the WWW to the Semantic Web by automatic semantic tagging of Web pages
    Yang, HC
    Fifth International Conference on Computer and Information Technology - Proceedings, 2005, : 238 - 242
  • [37] Detection of the Innovative Logotypes on the Web Pages
    Mironczuk, Marcin
    Perelkiewicz, Michal
    Protasiewicz, Jaroslaw
    ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, ICAISC 2017, PT II, 2017, 10246 : 104 - 115
  • [38] Detection and Logging Changes in Web Pages
    Beglerovic, Vildana
    Pirija, Lejla
    Prazina, Irfan
    Okanovic, Vensada
    2022 21ST INTERNATIONAL SYMPOSIUM INFOTEH-JAHORINA (INFOTEH), 2022,
  • [39] Quantitative evaluation of web metrics for automatic genre classification of web pages
    Malhotra R.
    Sharma A.
    International Journal of System Assurance Engineering and Management, 2017, 8 (Suppl 2) : 1567 - 1579
  • [40] Automatic Identification of Temporal Information in Tourism Web Pages
    Weiser, Stephanie
    Laublet, Philippe
    Minel, Jean-Luc
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 127 - 131