Automatic template detection for structured web pages

被引:0
|
作者
Lo, Lawrence [1 ]
Ng, Vincent To-Yee [1 ]
Ng, Patrick [1 ]
Chan, Stephen C. F. [1 ]
机构
[1] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Hong Kong, Peoples R China
来源
2006 10TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, PROCEEDINGS, VOLS 1 AND 2 | 2006年
关键词
collaborative system; webpage template construction; XML;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Similar web pages of web sites on the World Wide Web are usually encoded from an underlying structured source, and generated dynamically from a pre-defined template, such as books' information pages in Amazon. com. By given a set of web pages from a common website, it is possible to extract the template by analyzing common patterns between the web pages. In our work, we developed the CF-EXALG (Collaborative Finer-EXALG), based on EXALG, to decompose web pages and finding their common structures. In our system, templates that are used to generate web pages can be discovered automatically and stored in XML format. Hence, data encoded in web pages can be easily extracted and the template can be stored for future manipulation. In our preliminary experiments, CF-EXALG has shown to be more accurate and efficient when compared with other similar systems.
引用
收藏
页码:708 / 713
页数:6
相关论文
共 50 条
  • [41] Automatic text categorization algorithm for chemical web pages
    Liang, Chun-Yan
    Xia, Zhao-Jie
    Guo, Li
    Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2004, 32 (SUPPL.): : 52 - 57
  • [42] Automatic XML conversion of web pages with common pattern
    Oh, K
    Park, D
    Hwang, EJ
    IC'2001: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTERNET COMPUTING, VOLS I AND II, 2001, : 213 - 218
  • [43] A reverse engineering approach for automatic annotation of Web pages
    De Virgilio, Roberto
    Frasincar, Flavius
    Hop, Walter
    Lachner, Stephan
    MULTIMEDIA TOOLS AND APPLICATIONS, 2013, 64 (01) : 119 - 140
  • [44] Automatic Detection for Java']JavaScript Obfuscation Attacks in Web Pages through String Pattern Analysis
    Choi, YoungHan
    Kim, TaeGhyoon
    Choi, SeokJin
    Lee, CheolWon
    FUTURE GENERATION INFORMATION TECHNOLOGY, PROCEEDINGS, 2009, 5899 : 160 - 172
  • [45] Training the genre classifier for automatic classification of web pages
    Vidulin, Vedrana
    Lustrek, Mitja
    Gams, Matjaz
    PROCEEDINGS OF THE ITI 2007 29TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY INTERFACES, 2007, : 93 - +
  • [46] Automatic Knowledge Acquire System Oriented to Web Pages
    Zhu Junwu
    Jiang Yi
    Xu Yingying
    2009 THIRD INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, VOL 2, PROCEEDINGS, 2009, : 487 - 490
  • [47] Ontology-based automatic classification of web pages
    Song, Mu-Hee
    Lim, Soo-Yeon
    Park, Seong-Bae
    Kang, Dong-Jin
    Lee, Sang-Jo
    APPLIED SOFT COMPUTING TECHNOLOGIES: THE CHALLENGE OF COMPLEXITY, 2006, 34 : 483 - 493
  • [48] The automatic classification of web pages based on neural network
    Zhang, YZ
    Zhao, MS
    Wu, YS
    8TH INTERNATIONAL CONFERENCE ON NEURAL INFORMATION PROCESSING, VOLS 1-3, PROCEEDING, 2001, : 570 - 575
  • [49] A reverse engineering approach for automatic annotation of Web pages
    Roberto De Virgilio
    Flavius Frasincar
    Walter Hop
    Stephan Lachner
    Multimedia Tools and Applications, 2013, 64 : 119 - 140
  • [50] AutoWeb: Automatic Classification of Mobile Web Pages for Revisitation
    Liu, Jie
    Xu, Wenchang
    Shi, Yuanchun
    MOBILEHCI '12: COMPANION PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON HUMAN COMPUTER INTERACTION WITH MOBILE DEVICES AND SERVICES, 2012, : 153 - 153