Ducky : A Data Extraction System for Various Structured Web Documents

被引：2

作者：

Kanaoka, Kei ^{[1
]}

Fujii, Yotaro ^{[1
]}

Toyama, Motomichi ^{[1
]}

机构：

[1] Keio Univ, Dept Comp Sci, Yokohama, Kanagawa, Japan

来源：

PROCEEDINGS OF THE 18TH INTERNATIONAL DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM (IDEAS14) | 2014年

关键词：

Data Extraction; Web scraping; Web Wrapper; CSS selector;

D O I：

10.1145/2628194.2628244

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The World Wide Web has become a primary source of information. Therefore, extracting data from Web sources has become a key technology. In this paper, we introduce a semi-automatic system Ducky : including a Web Wrapper which extracts data from Web sources and translates them into structured data. In Ducky, by defining a configuration file consisting in several parameters (URL of the Web page, CSS selectors which locates the data to retrieve and so on.), users do not need to write Web scraping programs at all. The definition is simple, yet can extract data flexibly from various structured Web pages. Additionally, Ducky provides a Web API and various output data formats: XML, JSON, CSV. Finally, experimentations confirmed that Ducky can accurately extract data from 22 different structured Web sources.

引用

页码：342 / 347

页数：6

共 50 条

[31] AUTOMATIC WRAPPER SYSTEM FOR SEMI-STRUCTURED DOCUMENTS BASED ON DATA MINING
Rancea, Irina
Sgarciu, Valentin
UNIVERSITY POLITEHNICA OF BUCHAREST SCIENTIFIC BULLETIN SERIES C-ELECTRICAL ENGINEERING AND COMPUTER SCIENCE, 2012, 74 (04): : 55 - 66
[32] Structured Data in Web Search
Halevy, Alon
PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 7 - 7
[33] An Analysis of Structured Data on the Web
Dalvi, Nilesh
Machanavajjhala, Ashwin
Pang, Bo
PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (07): : 680 - 691
[34] Temporal and spatial attribute extraction from web documents and time-specific regional web search system
Tezuka, T
Tanaka, K
WEB AND WIRELESS GEOGRAPHICAL INFORMATION SYSTEMS, 2005, 3428 : 14 - 25
[35] Information extraction from Web pages using semi-structured data alignment
Kuboyama, Tetsuji
Miyahara, Tetsuhiro
Hirokawa, Sachio
Itou, Eisuke
WMSCI 2005: 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Vol 1, 2005, : 42 - 47
[36] Opinion Extraction & Classification of Reviews from Web Documents
Shandilya, Shishir K.
Jain, Suresh
2009 IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE, VOLS 1-3, 2009, : 924 - 927
[37] Cultural Heritage: knowledge extraction from web documents
Sassolini, Eva
Cinini, Alessandra
LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 3363 - 3367
[38] Learning from similarity and information extraction from structured documents
Holecek, Martin
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021,
[39] Cell Extraction and Horizontal-Scale Correction in Structured Documents
Srivastava, Divya
Harit, Gaurav
PROCEEDINGS OF 3RD INTERNATIONAL CONFERENCE ON COMPUTER VISION AND IMAGE PROCESSING, CVIP 2018, VOL 2, 2020, 1024 : 53 - 64
[40] Learning from similarity and information extraction from structured documents
Martin Holeček
International Journal on Document Analysis and Recognition (IJDAR), 2021, 24 : 149 - 165

← 1 2 3 4 5 →