Information extraction from Web pages using presentation regularities and domain knowledge

被引：11

作者：

Vadrevu, Srinivas ^{[1
]}

Gelgi, Fatih ^{[1
]}

Davulcu, Hasan ^{[1
]}

机构：

[1] Arizona State Univ, Dept Comp Sci & Engn, Tempe, AZ 85287 USA

来源：

WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2007年 / 10卷 / 02期

关键词：

information extraction; web; page segmentation; grammar induction; pattern mining; semantic partitioner; metadata; domain knowledge; statistical domain model;

D O I：

10.1007/s11280-007-0021-1

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.

引用

页码：157 / 179

页数：23

共 50 条

[31] Information extraction from massive Web pages based on node property and text content
Wang H.-Y.
Cao P.
1600, Editorial Board of Journal on Communications (37): : 9 - 17
[32] Automatic information extraction from semi-structured Web pages by pattern discovery
Chang, CH
Hsu, CN
Lui, SC
DECISION SUPPORT SYSTEMS, 2003, 35 (01) : 129 - 147
[33] A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
Uzun, Erdinc
IEEE ACCESS, 2020, 8 : 61726 - 61740
[34] DOM Semantic Expansion-Based Extraction of Topical Information from Web Pages
Chen, Junjie
Jia, Junyao
Duan, Liguo
WEB INFORMATION SYSTEMS AND MINING, PT II, 2011, 6988 : 343 - 350
[35] Discovering Knowledge from Conference Web Pages
You, Yue
Wang, Peng
Zhang, Xiang
2011 INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI 2011), 2011, : 173 - 178
[36] Framework for Web Application Domain Knowledge Extraction
Rozanc, I.
2013 36TH INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2013, : 705 - 710
[37] Semiautomatic extraction of topic maps from Web pages using clustering with web contents and structure
Mase, Motohiro
Yamada, Seiji
Nitta, Katsumi
PROCEEDING OF THE 2007 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY, WORKSHOPS, 2007, : 208 - +
[38] Fast Information Retrieval from Web Pages
El-Bakry, Hazem M.
Mastorakis, Nikos
PROCEEDINGS OF THE 7TH WSEAS INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE, MAN-MACHINE SYSTEMS AND CYBERNETICS (CIMMACS '08): RECENT ADVANCES IN COMPUTATIONAL INTELLIGENCE, MAN-MACHINE SYSTEMS AND CYBERNETICS, 2008, : 229 - +
[39] Web pages classification using domain ontology and clustering
Soltani, Sima
Barforoush, Ahmad Abdollahzadeh
CIS: 2007 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PROCEEDINGS, 2007, : 242 - +
[40] WEB PAGES CLASSIFICATION USING DOMAIN ONTOLOGY AND CLUSTERING
Soltani, Sima
Barforoush, Ahmad Abdollahzadeh
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2009, 23 (01) : 17 - 29

← 1 2 3 4 5 →