Schema Inference and Data Extraction from Templatized Web Pages

被引：0

作者：

Krishna, Shinde Santaji ^{[1
]}

Dattatraya, Joshi Shashank ^{[2
]}

机构：

[1] Shri Jagdish Prasad Jhabarmal Tibrewala Univ, Dept Comp Engn, Jhunjhunu, Rajasthan, India

[2] Bharati Vidyapeeth Deemed Univ, Coll Engn, Dept Comp Engn, Pune, Maharashtra, India

来源：

2015 INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING (ICPC) | 2015年

关键词：

Data Extraction; Multiple Tree Merging; Schema; Vision-based Page Segmentation; Web page;

D O I：

暂无

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However, there are various web sites having most of the pages that contains structured data. Thus, for Web Information integration, an important step is to extract information from Web documents for the websites. This paper presents an unsupervised approach to providing page-level data extraction task. It automatically detects schema of web pages. Web pages are compared based on visual clues to find fixed/variant template pages. Then data region from web pages are extracted and if they belong to fixed template then, schema recognized by applying tree merging, tree alignment and mining techniques. For heterogeneous template pages, variant tree matching algorithm is used.

引用

页数：6

共 50 条

[31] Conceptual-model-based data extraction from multiple-record Web pages
Embley, D.W.
Campbell, D.M.
Jiang, Y.S.
Liddle, S.W.
Lonsdale, D.W.
Ng, Y.-K.
Smith, R.D.
Data and Knowledge Engineering, 1999, 31 (03): : 227 - 251
[32] Automatic generation of agents for collecting hidden Web pages for data extraction
Lage, JP
da Silva, AS
Golgher, PB
Laender, AHF
DATA & KNOWLEDGE ENGINEERING, 2004, 49 (02) : 177 - 196
[33] Wrapper maintenance for web-data extraction based on pages features
Zhou, Shunxian
Lin, Yaping
Wang, Jingpu
Yang, Xiaolin
INTELLIGENT INFORMATION PROCESSING AND WEB MINING, PROCEEDINGS, 2006, : 317 - +
[34] Conceptual-model-based data extraction from multiple-record Web pages
Embley, DW
Campbell, DM
Jiang, YS
Liddle, SW
Lonsdale, DW
Ng, YK
Smith, RD
DATA & KNOWLEDGE ENGINEERING, 1999, 31 (03) : 227 - 251
[35] A novel alignment algorithm for effective web data extraction from singleton-item pages
Oviliani Yenty Yuliana
Chia-Hui Chang
Applied Intelligence, 2018, 48 : 4355 - 4370
[36] A novel alignment algorithm for effective web data extraction from singleton-item pages
Yuliana, Oviliani Yenty
Chang, Chia-Hui
APPLIED INTELLIGENCE, 2018, 48 (11) : 4355 - 4370
[37] Extraction of core web content from web pages using noise elimination
Saravanan A.
Bama S.S.
Journal of Engineering Science and Technology Review, 2020, 13 (04) : 173 - 187
[38] Extraction of web news from web pages using a ternary tree approach
Laishram, Debina
Sebastian, Merin
2015 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATION ENGINEERING ICACCE 2015, 2015, : 628 - 633
[39] Hidden schema extraction in web documents
Carchiolo, V
Longheu, A
Malgeri, M
DATABASES IN NETWORKED INFORMATION SYSTEMS, PROCEEDINGS, 2003, 2822 : 42 - 52
[40] Hidden schema extraction in web documents
1600, International Affairs Committee; University of Aizu, (Springer Verlag):

← 1 2 3 4 5 →