Automatic template detection for structured web pages

被引:0
|
作者
Lo, Lawrence [1 ]
Ng, Vincent To-Yee [1 ]
Ng, Patrick [1 ]
Chan, Stephen C. F. [1 ]
机构
[1] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Hong Kong, Peoples R China
关键词
collaborative system; webpage template construction; XML;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Similar web pages of web sites on the World Wide Web are usually encoded from an underlying structured source, and generated dynamically from a pre-defined template, such as books' information pages in Amazon. com. By given a set of web pages from a common website, it is possible to extract the template by analyzing common patterns between the web pages. In our work, we developed the CF-EXALG (Collaborative Finer-EXALG), based on EXALG, to decompose web pages and finding their common structures. In our system, templates that are used to generate web pages can be discovered automatically and stored in XML format. Hence, data encoded in web pages can be easily extracted and the template can be stored for future manipulation. In our preliminary experiments, CF-EXALG has shown to be more accurate and efficient when compared with other similar systems.
引用
收藏
页码:708 / 713
页数:6
相关论文
共 50 条
  • [1] Tree-structured template generation for web pages
    Chuang, SL
    Hsu, JYJ
    IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2004), PROCEEDINGS, 2004, : 327 - +
  • [2] TEXT: Automatic Template Extraction from Heterogeneous Web Pages
    Kim, Chulyun
    Shim, Kyuseok
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (04) : 612 - 626
  • [3] Automatic data extraction from template generated web pages
    Ma, L
    Goharian, N
    Chowdhury, A
    PDPTA'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS 1-4, 2003, : 642 - 648
  • [4] Automatic data record detection in Web Pages
    Gao, Xiaoying
    Vuong, Le Phong Bao
    Zhang, Mengjie
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2007, 4798 : 349 - +
  • [5] Unsupervised Structured Data Extraction from Template-generated Web Pages
    Grigalis, Tomas
    Cenys, Antanas
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2014, 20 (02) : 169 - 192
  • [6] Automatic Role Detection of Visual Elements of Web Pages for Automatic Accessibility Evaluation
    Duarte, Carlos
    Salvado, Ana
    Akpinar, M. Elgin
    Yesilada, Yeliz
    Carrico, Luis
    15TH INTERNATIONAL WEB FOR ALL CONFERENCE (W4A) 2018, 2018,
  • [7] Template-driven Web pages
    Johansen, J
    DR DOBBS JOURNAL, 1997, 22 (11): : 74 - +
  • [8] Template-Driven Web Pages
    Holden, S
    DR DOBBS JOURNAL, 1999, 24 (01): : 12 - 12
  • [9] Automatic Detection of Webpages that Share the Same Web Template
    Alarte, Julian
    Insa, David
    Silva, Josep
    Tamarit, Salvador
    ELECTRONIC PROCEEDINGS IN THEORETICAL COMPUTER SCIENCE, 2014, (163): : 2 - 15
  • [10] Automatic information extraction from semi-structured Web pages by pattern discovery
    Chang, CH
    Hsu, CN
    Lui, SC
    DECISION SUPPORT SYSTEMS, 2003, 35 (01) : 129 - 147