Detecting tables in web documents

被引:14
|
作者
Kim, YS [1 ]
Lee, KH [1 ]
机构
[1] Yonsei Univ, Dept Comp Sci, Seodaemun Ku, Seoul 120749, South Korea
关键词
table detection; !text type='HTML']HTML[!/text] document; web document analysis; attribute-value relations extraction; information extraction;
D O I
10.1016/j.engappai.2005.01.009
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:745 / 757
页数:13
相关论文
共 50 条
  • [1] Detecting tables in HTML']HTML documents
    Wang, YL
    Hu, JY
    DOCUMENT ANALYSIS SYSTEM V, PROCEEDINGS, 2002, 2423 : 249 - 260
  • [2] Clustering Web Documents with Tables for Information Extraction
    Shchekotykhin, Kostyantyn
    Jannach, Dietmar
    Friedrich, Gerhard
    K-CAP'07: PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON KNOWLEDGE CAPTURE, 2007, : 169 - 170
  • [3] From HTML']HTML documents to web tables and rules
    Simon, Kai
    Lausen, Georg
    Boley, Harold
    2006 ICEC: EIGHTH INTERNATIONAL CONFERENCE ON ELECTRONIC COMMERCE, PROCEEDINGS: THE NEW E-COMMERCE: INNOVATIONS FOR CONQUERING CURRENT BARRIERS, OBSTACLES AND LIMITATIONS TO CONDUCTING SUCCESSFUL BUSINESS ON THE INTERNET, 2006, : 125 - 131
  • [4] Detecting Genuinely Read Parts of Web Documents
    Hlavac, Patrik
    Simko, Marian
    2017 12TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION AND PERSONALIZATION (SMAP 2017), 2017, : 6 - 11
  • [5] Detecting image purpose in World-Wide Web documents
    Paek, S
    Smith, JR
    DOCUMENT RECOGNITION V, 1998, 3305 : 151 - 158
  • [6] WCOND-mine: Algorithm for detecting web content outliers from web documents
    Agyemang, M
    Barker, K
    Alhajj, RS
    10TH IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS, PROCEEDINGS, 2005, : 885 - 890
  • [7] A structural, content-similarity measure for detecting spam documents on the web
    Pera, Maria Soledad
    Yiu-Kai Ng
    INTERNATIONAL JOURNAL OF WEB INFORMATION SYSTEMS, 2009, 5 (04) : 431 - 464
  • [8] Searching for tables in digital documents
    Liu, Ying
    Bal, Kun
    Mitra, Prasenjit
    Giles, C. Lee
    ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 934 - 938
  • [9] Reproducing tables in scanned documents
    Jahan, M. A. C. Akmal
    Ragel, Roshan G.
    JOURNAL OF THE NATIONAL SCIENCE FOUNDATION OF SRI LANKA, 2016, 44 (04): : 367 - 377
  • [10] Russian Web Tables: A Public Corpus of Web Tables for Russian Language Based on Wikipedia
    Fedorov P.E.
    Mironov A.V.
    Chernishev G.A.
    Lobachevskii Journal of Mathematics, 2023, 44 (1) : 111 - 122