Schema Inference and Data Extraction from Templatized Web Pages

被引:0
|
作者
Krishna, Shinde Santaji [1 ]
Dattatraya, Joshi Shashank [2 ]
机构
[1] Shri Jagdish Prasad Jhabarmal Tibrewala Univ, Dept Comp Engn, Jhunjhunu, Rajasthan, India
[2] Bharati Vidyapeeth Deemed Univ, Coll Engn, Dept Comp Engn, Pune, Maharashtra, India
关键词
Data Extraction; Multiple Tree Merging; Schema; Vision-based Page Segmentation; Web page;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However, there are various web sites having most of the pages that contains structured data. Thus, for Web Information integration, an important step is to extract information from Web documents for the websites. This paper presents an unsupervised approach to providing page-level data extraction task. It automatically detects schema of web pages. Web pages are compared based on visual clues to find fixed/variant template pages. Then data region from web pages are extracted and if they belong to fixed template then, schema recognized by applying tree merging, tree alignment and mining techniques. For heterogeneous template pages, variant tree matching algorithm is used.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Data extraction from Deep Web pages
    Yang, Jufeng
    Shi, Guangshun
    Zheng, Yan
    Wang, Qingren
    CIS: 2007 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PROCEEDINGS, 2007, : 237 - 241
  • [2] Structrued and semantic data extraction from Web pages
    Gan, Y
    Zhang, SZ
    PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2004, : 2930 - 2935
  • [3] Schema Extraction for Tabular Data on the Web
    Adelfio, Marco D.
    Samet, Hanan
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (06): : 421 - 432
  • [4] Automatic data extraction from data-rich web pages
    Hu, DD
    Meng, XF
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2005, 3453 : 828 - 839
  • [5] Extraction of flat and nested data records from web pages
    Algur, Siddu P.
    Hiremath, P.S.
    Conferences in Research and Practice in Information Technology Series, 2006, 61 : 163 - 168
  • [6] Automatic data extraction from template generated web pages
    Ma, L
    Goharian, N
    Chowdhury, A
    PDPTA'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS 1-4, 2003, : 642 - 648
  • [7] Data extraction and annotation for dynamic web pages
    Song, H
    Giri, S
    Ma, FY
    2004 IEEE INTERNATIONAL CONFERNECE ON E-TECHNOLOGY, E-COMMERE AND E-SERVICE, PROCEEDINGS, 2004, : 499 - 502
  • [8] Data extraction from the web based on pre-defined schema
    Xiaofeng Meng
    Hongjun Lu
    Haiyan Gang
    Mingzhe Gu
    Journal of Computer Science and Technology, 2002, 17 : 377 - 388
  • [9] Data extraction from the Web based on pre-defined schema
    Meng, XF
    Lu, HJ
    Wang, HY
    Gu, MZ
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2002, 17 (04) : 377 - 388
  • [10] Information Extraction from Web pages
    Novotny, Robert
    Vojtas, Peter
    Maruscak, Dusan
    2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 3, 2009, : 121 - +