Information extraction from Web pages using presentation regularities and domain knowledge

被引:11
|
作者
Vadrevu, Srinivas [1 ]
Gelgi, Fatih [1 ]
Davulcu, Hasan [1 ]
机构
[1] Arizona State Univ, Dept Comp Sci & Engn, Tempe, AZ 85287 USA
关键词
information extraction; web; page segmentation; grammar induction; pattern mining; semantic partitioner; metadata; domain knowledge; statistical domain model;
D O I
10.1007/s11280-007-0021-1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.
引用
收藏
页码:157 / 179
页数:23
相关论文
共 50 条
  • [21] Shallow Information Extraction for the Knowledge Web
    Barbosa, Denilson
    Wang, Haixun
    Yu, Cong
    2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 1264 - 1267
  • [22] Data extraction from Deep Web pages
    Yang, Jufeng
    Shi, Guangshun
    Zheng, Yan
    Wang, Qingren
    CIS: 2007 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PROCEEDINGS, 2007, : 237 - 241
  • [23] Extraction of Informative Blocks from Web Pages
    Cao, YuJuan
    Niu, ZhenDong
    Dai, LiuLing
    Zhao, YuMing
    ALPIT 2008: SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED LANGUAGE PROCESSING AND WEB INFORMATION TECHNOLOGY, PROCEEDINGS, 2008, : 544 - 549
  • [24] Advertising Keywords Extraction from Web Pages
    Liu, Jianyi
    Wang, Cong
    Liu, Zhengyang
    Yao, Wenbin
    WEB INFORMATION SYSTEMS AND MINING, 2010, 6318 : 336 - 343
  • [25] Extraction of hidden semantics from web pages
    Carchiolo, V
    Longheu, A
    Malgeri, M
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 117 - 122
  • [26] Grouping web pages about persons and organizations for information extraction
    Ye, SR
    Chua, TS
    Liu, JM
    Kei, JR
    DIGITAL LIBRARIES: PEOPLE, KNOWLEDGE, AND TECHNOLOGY, PROCEEDINGS, 2002, 2555 : 241 - 251
  • [27] Learning (k, l)-contextual tree languages for information extraction from web pages
    Raeymaekers, Stefan
    Bruynooghe, Maurice
    Van den Bussche, Jan
    MACHINE LEARNING, 2008, 71 (2-3) : 155 - 183
  • [28] Learning information extraction patterns from tabular web pages without manual labelling
    Gao, XY
    Zhang, MJ
    Andreae, P
    IEEE/WIC INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, PROCEEDINGS, 2003, : 495 - 498
  • [29] Learning (k,l)-contextual tree languages for information extraction from web pages
    Stefan Raeymaekers
    Maurice Bruynooghe
    Jan Van den Bussche
    Machine Learning, 2008, 71 : 155 - 183
  • [30] An open platform for collecting domain specific web pages and extracting information from them
    Karkaletsis, V
    Spyropoulos, CD
    Knowledge Mining, 2005, 185 : 147 - 157