Ducky : A Data Extraction System for Various Structured Web Documents

被引：2

作者：

Kanaoka, Kei ^{[1
]}

Fujii, Yotaro ^{[1
]}

Toyama, Motomichi ^{[1
]}

机构：

[1] Keio Univ, Dept Comp Sci, Yokohama, Kanagawa, Japan

来源：

PROCEEDINGS OF THE 18TH INTERNATIONAL DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM (IDEAS14) | 2014年

关键词：

Data Extraction; Web scraping; Web Wrapper; CSS selector;

D O I：

10.1145/2628194.2628244

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The World Wide Web has become a primary source of information. Therefore, extracting data from Web sources has become a key technology. In this paper, we introduce a semi-automatic system Ducky : including a Web Wrapper which extracts data from Web sources and translates them into structured data. In Ducky, by defining a configuration file consisting in several parameters (URL of the Web page, CSS selectors which locates the data to retrieve and so on.), users do not need to write Web scraping programs at all. The definition is simple, yet can extract data flexibly from various structured Web pages. Additionally, Ducky provides a Web API and various output data formats: XML, JSON, CSV. Finally, experimentations confirmed that Ducky can accurately extract data from 22 different structured Web sources.

引用

页码：342 / 347

页数：6

共 50 条

[1] ENRICHED MARKING OF STRUCTURED DATA FOR WEB DOCUMENTS
Adida, Ben
Herman, Ivan
Sporny, Manu
Birbeck, Mark
ANALES DE DOCUMENTACION, 2013, 16 (01):
[2] Information extraction from semi-structured web documents
Yun, Bo-Hyun
Seo, Chang-Ho
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598
[3] FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents
Lin, Bill Yuchen
Sheng, Ying
Nguyen Vo
Tata, Sandeep
KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 1092 - 1102
[4] EGA: An algorithm for automatic semi-structured Web documents extraction
Li, LY
Tang, SW
Yang, DQ
Wang, TJ
Su, ZH
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, 2004, 2973 : 787 - 798
[5] Web-Scale Extraction of Structured Data
Cafarella, Michael J.
Madhavan, Jayant
Halevy, Alon
SIGMOD RECORD, 2008, 37 (04) : 55 - 61
[6] Automatic Extraction of Structured Web Data with Domain Knowledge
Derouiche, Nora
Cautis, Bogdan
Abdessalem, Talel
2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 726 - 737
[7] ObjectRunner: Lightweight, Targeted Extraction and Querying of Structured Web Data
Abdessalem, Talel
Cautis, Bogdan
Derouiche, Nora
PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (02): : 1585 - 1588
[8] Web Service for Data Extraction from Semi-structured Data Sources
Yashina, Marina V.
Nakonechnyy, Ivan I.
PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON DEPENDABILITY AND COMPLEX SYSTEMS DEPCOS-RELCOMEX, 2014, 286 : 499 - 510
[9] Hidden schema extraction in web documents
Carchiolo, V
Longheu, A
Malgeri, M
DATABASES IN NETWORKED INFORMATION SYSTEMS, PROCEEDINGS, 2003, 2822 : 42 - 52
[10] Hidden schema extraction in web documents
1600, International Affairs Committee; University of Aizu, (Springer Verlag):

← 1 2 3 4 5 →