Schema Inference and Data Extraction from Templatized Web Pages

被引：0

作者：

Krishna, Shinde Santaji ^{[1
]}

Dattatraya, Joshi Shashank ^{[2
]}

机构：

[1] Shri Jagdish Prasad Jhabarmal Tibrewala Univ, Dept Comp Engn, Jhunjhunu, Rajasthan, India

[2] Bharati Vidyapeeth Deemed Univ, Coll Engn, Dept Comp Engn, Pune, Maharashtra, India

来源：

2015 INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING (ICPC) | 2015年

关键词：

Data Extraction; Multiple Tree Merging; Schema; Vision-based Page Segmentation; Web page;

D O I：

暂无

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However, there are various web sites having most of the pages that contains structured data. Thus, for Web Information integration, an important step is to extract information from Web documents for the websites. This paper presents an unsupervised approach to providing page-level data extraction task. It automatically detects schema of web pages. Web pages are compared based on visual clues to find fixed/variant template pages. Then data region from web pages are extracted and if they belong to fixed template then, schema recognized by applying tree merging, tree alignment and mining techniques. For heterogeneous template pages, variant tree matching algorithm is used.

引用

页数：6

共 50 条

[21] Wrapper inference for ambiguous web pages
Crescenzi, Valter
Merialdo, Paolo
APPLIED ARTIFICIAL INTELLIGENCE, 2008, 22 (1-2) : 21 - 52
[22] Data Engineered Content Extraction Studies for Indian Web Pages
Kolla, Bhanu Prakash
Raman, Arun Raja
COMPUTATIONAL INTELLIGENCE IN DATA MINING, 2019, 711 : 505 - 512
[23] A Novel Approach for Extraction and Representation of Main Data from Web Pages to Android Application
Veeraiah, D.
Ramanjaneyulu, Y. V.
Yakobu, D.
Sahithi, T.
2016 IEEE INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ELECTRONICS, INFORMATION & COMMUNICATION TECHNOLOGY (RTEICT), 2016, : 1126 - 1130
[24] The mining and extraction of primary informative blocks and data objects from systematic web pages
Tseng, Yi-Feng
Kao, Hung-Yu
2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 370 - +
[25] Towards XML Schema Extraction from Deep Web
Saissi, Yasser
Zellou, Ahmed
Idri, Ali
2016 4TH IEEE INTERNATIONAL COLLOQUIUM ON INFORMATION SCIENCE AND TECHNOLOGY (CIST), 2016, : 94 - 99
[26] Information extraction from Web pages using semi-structured data alignment
Kuboyama, Tetsuji
Miyahara, Tetsuhiro
Hirokawa, Sachio
Itou, Eisuke
WMSCI 2005: 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Vol 1, 2005, : 42 - 47
[27] Turkish Keyphrase Extraction from Web Pages with BERT
Ayan, Emre Tolga
Arslan, Rabia
Zengin, Muhammed Said
Duru, Haci Ali
Salman, Sedat
Bardak, Batuhan
29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,
[28] A Novel Approach for Content Extraction from Web Pages
Bhardwaj, Aanshi
Mangat, Veenu
2014 RECENT ADVANCES IN ENGINEERING AND COMPUTATIONAL SCIENCES (RAECS), 2014,
[29] On-line Versioned Schema Inference for Large Semantic Web Data Sources
Kellou-Menouer, Kenza
Kedad, Zoubida
SSDBM 2017: 29TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2017,
[30] An effective method supporting data extraction and schema recognition on deep web
Liu, Wei
Shen, Derong
Nie, Tiezheng
PROGRESS IN WWW RESEARCH AND DEVELOPMENT, PROCEEDINGS, 2008, 4976 : 419 - 431

← 1 2 3 4 5 →