Schema Inference and Data Extraction from Templatized Web Pages

被引:0
|
作者
Krishna, Shinde Santaji [1 ]
Dattatraya, Joshi Shashank [2 ]
机构
[1] Shri Jagdish Prasad Jhabarmal Tibrewala Univ, Dept Comp Engn, Jhunjhunu, Rajasthan, India
[2] Bharati Vidyapeeth Deemed Univ, Coll Engn, Dept Comp Engn, Pune, Maharashtra, India
关键词
Data Extraction; Multiple Tree Merging; Schema; Vision-based Page Segmentation; Web page;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However, there are various web sites having most of the pages that contains structured data. Thus, for Web Information integration, an important step is to extract information from Web documents for the websites. This paper presents an unsupervised approach to providing page-level data extraction task. It automatically detects schema of web pages. Web pages are compared based on visual clues to find fixed/variant template pages. Then data region from web pages are extracted and if they belong to fixed template then, schema recognized by applying tree merging, tree alignment and mining techniques. For heterogeneous template pages, variant tree matching algorithm is used.
引用
收藏
页数:6
相关论文
共 50 条
  • [21] Wrapper inference for ambiguous web pages
    Crescenzi, Valter
    Merialdo, Paolo
    APPLIED ARTIFICIAL INTELLIGENCE, 2008, 22 (1-2) : 21 - 52
  • [22] Data Engineered Content Extraction Studies for Indian Web Pages
    Kolla, Bhanu Prakash
    Raman, Arun Raja
    COMPUTATIONAL INTELLIGENCE IN DATA MINING, 2019, 711 : 505 - 512
  • [23] A Novel Approach for Extraction and Representation of Main Data from Web Pages to Android Application
    Veeraiah, D.
    Ramanjaneyulu, Y. V.
    Yakobu, D.
    Sahithi, T.
    2016 IEEE INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ELECTRONICS, INFORMATION & COMMUNICATION TECHNOLOGY (RTEICT), 2016, : 1126 - 1130
  • [24] The mining and extraction of primary informative blocks and data objects from systematic web pages
    Tseng, Yi-Feng
    Kao, Hung-Yu
    2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 370 - +
  • [25] Towards XML Schema Extraction from Deep Web
    Saissi, Yasser
    Zellou, Ahmed
    Idri, Ali
    2016 4TH IEEE INTERNATIONAL COLLOQUIUM ON INFORMATION SCIENCE AND TECHNOLOGY (CIST), 2016, : 94 - 99
  • [26] Information extraction from Web pages using semi-structured data alignment
    Kuboyama, Tetsuji
    Miyahara, Tetsuhiro
    Hirokawa, Sachio
    Itou, Eisuke
    WMSCI 2005: 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Vol 1, 2005, : 42 - 47
  • [27] Turkish Keyphrase Extraction from Web Pages with BERT
    Ayan, Emre Tolga
    Arslan, Rabia
    Zengin, Muhammed Said
    Duru, Haci Ali
    Salman, Sedat
    Bardak, Batuhan
    29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,
  • [28] A Novel Approach for Content Extraction from Web Pages
    Bhardwaj, Aanshi
    Mangat, Veenu
    2014 RECENT ADVANCES IN ENGINEERING AND COMPUTATIONAL SCIENCES (RAECS), 2014,
  • [29] On-line Versioned Schema Inference for Large Semantic Web Data Sources
    Kellou-Menouer, Kenza
    Kedad, Zoubida
    SSDBM 2017: 29TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2017,
  • [30] An effective method supporting data extraction and schema recognition on deep web
    Liu, Wei
    Shen, Derong
    Nie, Tiezheng
    PROGRESS IN WWW RESEARCH AND DEVELOPMENT, PROCEEDINGS, 2008, 4976 : 419 - 431