Detecting tables in web documents

被引：14

作者：

Kim, YS ^{[1
]}

Lee, KH ^{[1
]}

机构：

[1] Yonsei Univ, Dept Comp Sci, Seodaemun Ku, Seoul 120749, South Korea

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2005年 / 18卷 / 06期

关键词：

table detection; !text type='HTML']HTML[!/text] document; web document analysis; attribute-value relations extraction; information extraction;

D O I：

10.1016/j.engappai.2005.01.009

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%. (c) 2005 Elsevier Ltd. All rights reserved.

引用

页码：745 / 757

页数：13

共 50 条

[1] Detecting tables in HTML']HTML documents
Wang, YL
Hu, JY
DOCUMENT ANALYSIS SYSTEM V, PROCEEDINGS, 2002, 2423 : 249 - 260
[2] Clustering Web Documents with Tables for Information Extraction
Shchekotykhin, Kostyantyn
Jannach, Dietmar
Friedrich, Gerhard
K-CAP'07: PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON KNOWLEDGE CAPTURE, 2007, : 169 - 170
[3] From HTML']HTML documents to web tables and rules
Simon, Kai
Lausen, Georg
Boley, Harold
2006 ICEC: EIGHTH INTERNATIONAL CONFERENCE ON ELECTRONIC COMMERCE, PROCEEDINGS: THE NEW E-COMMERCE: INNOVATIONS FOR CONQUERING CURRENT BARRIERS, OBSTACLES AND LIMITATIONS TO CONDUCTING SUCCESSFUL BUSINESS ON THE INTERNET, 2006, : 125 - 131
[4] Detecting Genuinely Read Parts of Web Documents
Hlavac, Patrik
Simko, Marian
2017 12TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION AND PERSONALIZATION (SMAP 2017), 2017, : 6 - 11
[5] Detecting image purpose in World-Wide Web documents
Paek, S
Smith, JR
DOCUMENT RECOGNITION V, 1998, 3305 : 151 - 158
[6] WCOND-mine: Algorithm for detecting web content outliers from web documents
Agyemang, M
Barker, K
Alhajj, RS
10TH IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS, PROCEEDINGS, 2005, : 885 - 890
[7] A structural, content-similarity measure for detecting spam documents on the web
Pera, Maria Soledad
Yiu-Kai Ng
INTERNATIONAL JOURNAL OF WEB INFORMATION SYSTEMS, 2009, 5 (04) : 431 - 464
[8] Searching for tables in digital documents
Liu, Ying
Bal, Kun
Mitra, Prasenjit
Giles, C. Lee
ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 934 - 938
[9] Reproducing tables in scanned documents
Jahan, M. A. C. Akmal
Ragel, Roshan G.
JOURNAL OF THE NATIONAL SCIENCE FOUNDATION OF SRI LANKA, 2016, 44 (04): : 367 - 377
[10] Russian Web Tables: A Public Corpus of Web Tables for Russian Language Based on Wikipedia
Fedorov P.E.
Mironov A.V.
Chernishev G.A.
Lobachevskii Journal of Mathematics, 2023, 44 (1) : 111 - 122

← 1 2 3 4 5 →