Detecting tables in web documents

被引:14
|
作者
Kim, YS [1 ]
Lee, KH [1 ]
机构
[1] Yonsei Univ, Dept Comp Sci, Seodaemun Ku, Seoul 120749, South Korea
关键词
table detection; !text type='HTML']HTML[!/text] document; web document analysis; attribute-value relations extraction; information extraction;
D O I
10.1016/j.engappai.2005.01.009
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:745 / 757
页数:13
相关论文
共 50 条
  • [21] Generating Titles for Web Tables
    Hancock, Braden
    Lee, Hongrae
    Yu, Cong
    WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, : 638 - 647
  • [22] Web documents mining
    Song, QB
    Li, NQ
    Shen, JY
    Chen, LM
    2002 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-4, PROCEEDINGS, 2002, : 791 - 795
  • [23] Detecting mistakes in binary data tables
    Revenko, A.V.
    Automatic Documentation and Mathematical Linguistics, 2013, 47 (03) : 102 - 110
  • [24] Detecting Documents With Inconsistent Context
    Jung, Dongin
    Kim, Misuk
    Cho, Yoon-Sik
    IEEE ACCESS, 2022, 10 : 98970 - 98980
  • [25] Detecting changes in XML documents
    Cobéna, G
    Abiteboul, S
    Marian, A
    18TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2002, : 41 - 52
  • [26] Detecting Plagiarism in Text Documents
    Hariharan, Shanmugasundaram
    Kamal, Sirajudeen
    Faisal, Abdul Vadud Mohamed
    Azharudheen, Sheik Mohamed
    Raman, Bhaskaran
    INFORMATION PROCESSING AND MANAGEMENT, 2010, 70 : 497 - 500
  • [27] Detecting Documents With Inconsistent Context
    Jung, Dongin
    Kim, Misuk
    Cho, Yoon-Sik
    IEEE Access, 2022, 10 : 98970 - 98980
  • [28] Are Your Digital Documents Web Friendly?: Making Scanned Documents Web Accessible
    Zhou, Yongli
    INFORMATION TECHNOLOGY AND LIBRARIES, 2010, 29 (03) : 151 - 160
  • [29] Characterization of web objects in popular web documents
    Abhari, A
    Dandamudi, SP
    Majumdar, S
    PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, 2000, : 616 - 623
  • [30] Generating structured documents from HTML']HTML tables
    Kim, Yeon-Seok
    Lee, Kyong-Ho
    2006 INTERNATIONAL CONFERENCE ON HYBRID INFORMATION TECHNOLOGY, VOL 2, PROCEEDINGS, 2006, : 605 - +