Detecting tables in web documents

被引：14

作者：

Kim, YS ^{[1
]}

Lee, KH ^{[1
]}

机构：

[1] Yonsei Univ, Dept Comp Sci, Seodaemun Ku, Seoul 120749, South Korea

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2005年 / 18卷 / 06期

关键词：

table detection; !text type='HTML']HTML[!/text] document; web document analysis; attribute-value relations extraction; information extraction;

D O I：

10.1016/j.engappai.2005.01.009

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%. (c) 2005 Elsevier Ltd. All rights reserved.

引用

页码：745 / 757

页数：13

共 50 条

[21] Generating Titles for Web Tables
Hancock, Braden
Lee, Hongrae
Yu, Cong
WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, : 638 - 647
[22] Web documents mining
Song, QB
Li, NQ
Shen, JY
Chen, LM
2002 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-4, PROCEEDINGS, 2002, : 791 - 795
[23] Detecting mistakes in binary data tables
Revenko, A.V.
Automatic Documentation and Mathematical Linguistics, 2013, 47 (03) : 102 - 110
[24] Detecting Documents With Inconsistent Context
Jung, Dongin
Kim, Misuk
Cho, Yoon-Sik
IEEE ACCESS, 2022, 10 : 98970 - 98980
[25] Detecting changes in XML documents
Cobéna, G
Abiteboul, S
Marian, A
18TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2002, : 41 - 52
[26] Detecting Plagiarism in Text Documents
Hariharan, Shanmugasundaram
Kamal, Sirajudeen
Faisal, Abdul Vadud Mohamed
Azharudheen, Sheik Mohamed
Raman, Bhaskaran
INFORMATION PROCESSING AND MANAGEMENT, 2010, 70 : 497 - 500
[27] Detecting Documents With Inconsistent Context
Jung, Dongin
Kim, Misuk
Cho, Yoon-Sik
IEEE Access, 2022, 10 : 98970 - 98980
[28] Are Your Digital Documents Web Friendly?: Making Scanned Documents Web Accessible
Zhou, Yongli
INFORMATION TECHNOLOGY AND LIBRARIES, 2010, 29 (03) : 151 - 160
[29] Characterization of web objects in popular web documents
Abhari, A
Dandamudi, SP
Majumdar, S
PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, 2000, : 616 - 623
[30] Generating structured documents from HTML']HTML tables
Kim, Yeon-Seok
Lee, Kyong-Ho
2006 INTERNATIONAL CONFERENCE ON HYBRID INFORMATION TECHNOLOGY, VOL 2, PROCEEDINGS, 2006, : 605 - +

← 1 2 3 4 5 →