Information extraction from Web pages using presentation regularities and domain knowledge

被引：11

作者：

Vadrevu, Srinivas ^{[1
]}

Gelgi, Fatih ^{[1
]}

Davulcu, Hasan ^{[1
]}

机构：

[1] Arizona State Univ, Dept Comp Sci & Engn, Tempe, AZ 85287 USA

来源：

WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2007年 / 10卷 / 02期

关键词：

information extraction; web; page segmentation; grammar induction; pattern mining; semantic partitioner; metadata; domain knowledge; statistical domain model;

D O I：

10.1007/s11280-007-0021-1

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.

引用

页码：157 / 179

页数：23

共 50 条

[21] Shallow Information Extraction for the Knowledge Web
Barbosa, Denilson
Wang, Haixun
Yu, Cong
2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 1264 - 1267
[22] Data extraction from Deep Web pages
Yang, Jufeng
Shi, Guangshun
Zheng, Yan
Wang, Qingren
CIS: 2007 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PROCEEDINGS, 2007, : 237 - 241
[23] Extraction of Informative Blocks from Web Pages
Cao, YuJuan
Niu, ZhenDong
Dai, LiuLing
Zhao, YuMing
ALPIT 2008: SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED LANGUAGE PROCESSING AND WEB INFORMATION TECHNOLOGY, PROCEEDINGS, 2008, : 544 - 549
[24] Advertising Keywords Extraction from Web Pages
Liu, Jianyi
Wang, Cong
Liu, Zhengyang
Yao, Wenbin
WEB INFORMATION SYSTEMS AND MINING, 2010, 6318 : 336 - 343
[25] Extraction of hidden semantics from web pages
Carchiolo, V
Longheu, A
Malgeri, M
INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 117 - 122
[26] Grouping web pages about persons and organizations for information extraction
Ye, SR
Chua, TS
Liu, JM
Kei, JR
DIGITAL LIBRARIES: PEOPLE, KNOWLEDGE, AND TECHNOLOGY, PROCEEDINGS, 2002, 2555 : 241 - 251
[27] Learning (k, l)-contextual tree languages for information extraction from web pages
Raeymaekers, Stefan
Bruynooghe, Maurice
Van den Bussche, Jan
MACHINE LEARNING, 2008, 71 (2-3) : 155 - 183
[28] Learning information extraction patterns from tabular web pages without manual labelling
Gao, XY
Zhang, MJ
Andreae, P
IEEE/WIC INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, PROCEEDINGS, 2003, : 495 - 498
[29] Learning (k,l)-contextual tree languages for information extraction from web pages
Stefan Raeymaekers
Maurice Bruynooghe
Jan Van den Bussche
Machine Learning, 2008, 71 : 155 - 183
[30] An open platform for collecting domain specific web pages and extracting information from them
Karkaletsis, V
Spyropoulos, CD
Knowledge Mining, 2005, 185 : 147 - 157

← 1 2 3 4 5 →