Detecting tables in web documents

被引:14
|
作者
Kim, YS [1 ]
Lee, KH [1 ]
机构
[1] Yonsei Univ, Dept Comp Sci, Seodaemun Ku, Seoul 120749, South Korea
关键词
table detection; !text type='HTML']HTML[!/text] document; web document analysis; attribute-value relations extraction; information extraction;
D O I
10.1016/j.engappai.2005.01.009
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:745 / 757
页数:13
相关论文
共 50 条
  • [41] Scalable Spam Classifier for Web Tables
    Villasenor, Santiago
    Nguyen, Tom
    Kola, Anusha
    Soderman, Sean
    Gubanov, Michael
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 4849 - 4851
  • [42] WebTables: Exploring the Power of Tables on the Web
    Cafarella, Michael J.
    Halevy, Alon
    Wang, Daisy Zhe
    Wu, Eugene
    Zhang, Yang
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01): : 538 - 549
  • [43] Ontology extraction from tables on the web
    Tanaka, M
    Ishida, T
    INTERNATIONAL SYMPOSIUM ON APPLICATIONS AND THE INTERNET , PROCEEDINGS, 2006, : 284 - +
  • [44] Knowledge Exploration Using Tables on the Web
    Chirigati, Fernando
    Liu, Jialu
    Korn, Flip
    Wu, You
    Yu, Cong
    Zhang, Hao
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2016, 10 (03): : 193 - 204
  • [45] TabEL: Entity Linking in Web Tables
    Bhagavatula, Chandra Sekhar
    Noraset, Thanapon
    Downey, Doug
    SEMANTIC WEB - ISWC 2015, PT I, 2015, 9366 : 425 - 441
  • [46] Enabling Interactive Access to Web Tables
    Yang, Xin
    Xu, Wenchang
    Shi, Yuanchun
    HUMAN-COMPUTER INTERACTION, PT I, 2009, 5610 : 760 - 768
  • [47] Automatic construction of RDF with web tables
    Yan, Li
    Sheng, Jie
    Tu, Yaofeng
    Zhou, Xiangsheng
    Ma, Zongmin
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 182
  • [48] A method of recognizing tables and lists on the Web
    Wu, YY
    Yokota, H
    PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON COMMUNICATIONS, INTERNET, AND INFORMATION TECHNOLOGY, 2002, : 479 - 484
  • [49] Cold war documents on the Web
    Zeljak, C
    PROBLEMS OF POST-COMMUNISM, 2001, 48 (04) : 70 - 70
  • [50] Semantic Summarization of Web Documents
    d'Acierno, A.
    Moscato, V.
    Persia, F.
    Picariello, A.
    Penta, A.
    2010 IEEE FOURTH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2010), 2010, : 430 - 435