Detecting tables in web documents

被引:14
|
作者
Kim, YS [1 ]
Lee, KH [1 ]
机构
[1] Yonsei Univ, Dept Comp Sci, Seodaemun Ku, Seoul 120749, South Korea
关键词
table detection; !text type='HTML']HTML[!/text] document; web document analysis; attribute-value relations extraction; information extraction;
D O I
10.1016/j.engappai.2005.01.009
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:745 / 757
页数:13
相关论文
共 50 条
  • [31] Representations for Question Answering from Documents with Tables and Text
    Zayats, Vicky
    Toutanova, Kristina
    Ostendorf, Mari
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 2895 - 2906
  • [32] Development of a Framework to Understand Tables in Engineering Specification Documents
    Agossou, Valentin
    Suh, Hyo-Won
    Lee, Heejung
    Lee, Jae Hyun
    APPLIED SCIENCES-BASEL, 2020, 10 (18):
  • [33] Information Extraction from Handwritten Tables in Historical Documents
    Andres, Jose
    Ramon Prieto, Jose
    Granell, Emilio
    Romero, Veronica
    Andreu Sanchez, Joan
    Vidal, Enrique
    DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 184 - 198
  • [34] sparkTable: Generating Graphical Tables for Websites and Documents with R
    Kowarik, Alexander
    Meindl, Bernhard
    Templ, Matthias
    R JOURNAL, 2015, 7 (01): : 24 - 37
  • [35] An XML-Based Approach to Handling Tables in Documents
    Thirunarayan, Krishnaprasad
    Immaneni, Trivikram
    JOURNAL OF INTELLIGENT SYSTEMS, 2008, 17 (1-3) : 215 - 228
  • [36] Web Thermo Tables - an On-Line Version of the TRC Thermodynamic Tables
    Kazakov, Andrei
    Muzny, Chris D.
    Chirico, Robert D.
    Diky, Vladimir V.
    Frenkel, Michael
    JOURNAL OF RESEARCH OF THE NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY, 2008, 113 (04) : 209 - 220
  • [37] Pinakas: A Methodology for Deep Analysis of Tables in Technical Documents
    Alexiou, Michail S.
    Bourbakis, Nikolaos G.
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2023, 32 (04)
  • [38] Knowledge Exploration from Tables on the Web
    Haklae, Kim
    CURRENT TRENDS IN WEB ENGINEERING, ICWE 2019 INTERNATIONAL WORKSHOPS, 2020, 11609 : 27 - 30
  • [39] Tables and feedback forms in web pages
    Duval, BK
    Main, L
    LIBRARY SOFTWARE REVIEW, 1996, 15 (01): : 31 - 37
  • [40] Concept Expansion Using Web Tables
    Wang, Chi
    Chakrabarti, Kaushik
    He, Yeye
    Ganjam, Kris
    Chen, Zhimin
    Bernstein, Philip A.
    PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW 2015), 2015, : 1198 - 1208