Schema Inference and Data Extraction from Templatized Web Pages

被引:0
|
作者
Krishna, Shinde Santaji [1 ]
Dattatraya, Joshi Shashank [2 ]
机构
[1] Shri Jagdish Prasad Jhabarmal Tibrewala Univ, Dept Comp Engn, Jhunjhunu, Rajasthan, India
[2] Bharati Vidyapeeth Deemed Univ, Coll Engn, Dept Comp Engn, Pune, Maharashtra, India
关键词
Data Extraction; Multiple Tree Merging; Schema; Vision-based Page Segmentation; Web page;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However, there are various web sites having most of the pages that contains structured data. Thus, for Web Information integration, an important step is to extract information from Web documents for the websites. This paper presents an unsupervised approach to providing page-level data extraction task. It automatically detects schema of web pages. Web pages are compared based on visual clues to find fixed/variant template pages. Then data region from web pages are extracted and if they belong to fixed template then, schema recognized by applying tree merging, tree alignment and mining techniques. For heterogeneous template pages, variant tree matching algorithm is used.
引用
收藏
页数:6
相关论文
共 50 条
  • [41] Person Attribute Extraction from the Textual Parts of Web Pages
    Istvan, Nagy T.
    ACTA CYBERNETICA, 2012, 20 (03): : 419 - 440
  • [42] Zero-shot Entity Extraction from Web Pages
    Pasupat, Panupong
    Liang, Percy
    PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2014, : 391 - 401
  • [43] Automatic Extraction of Textual Elements from News Web Pages
    Ibrahim, Hossam
    Darwish, Kareem
    Abdel-sabor, Abdel-Rahim
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1600 - 1603
  • [44] TEXT: Automatic Template Extraction from Heterogeneous Web Pages
    Kim, Chulyun
    Shim, Kyuseok
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (04) : 612 - 626
  • [45] Unsupervised Keyphrase Extraction for Web Pages
    Haarman, Tim
    Zijlema, Bastiaan
    Wiering, Marco
    MULTIMODAL TECHNOLOGIES AND INTERACTION, 2019, 3 (03)
  • [46] Automatic Web Pages Author Extraction
    Changuel, Sahar
    Labroche, Nicolas
    Bouchon-Meunier, Bernadette
    FLEXIBLE QUERY ANSWERING SYSTEMS: 8TH INTERNATIONAL CONFERENCE, FQAS 2009, 2009, 5822 : 300 - 311
  • [47] Mining Schema Knowledge from Linked Data on the Web
    Mehri, Razieh
    Valtchev, Petko
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT (KSEM 2017): 10TH INTERNATIONAL CONFERENCE, KSEM 2017, MELBOURNE, VIC, AUSTRALIA, AUGUST 19-20, 2017, PROCEEDINGS, 2017, 10412 : 261 - 273
  • [48] Schema and web data management
    Madria, Sanjay Kumar
    Bhowmick, Sourav S.
    DATA & KNOWLEDGE ENGINEERING, 2008, 65 (02) : 175 - 176
  • [49] Web data and Schema Management
    Bhowmick, Sourav S.
    Madria, Sanjay Kumar
    Chakravarthy, Sharma
    DATA & KNOWLEDGE ENGINEERING, 2007, 60 (02) : 257 - 259
  • [50] Creating customized data services from web pages
    季光
    Wang Guiling
    Han Yanbo
    High Technology Letters, 2013, 19 (02) : 203 - 207