Ducky : A Data Extraction System for Various Structured Web Documents

被引:2
|
作者
Kanaoka, Kei [1 ]
Fujii, Yotaro [1 ]
Toyama, Motomichi [1 ]
机构
[1] Keio Univ, Dept Comp Sci, Yokohama, Kanagawa, Japan
关键词
Data Extraction; Web scraping; Web Wrapper; CSS selector;
D O I
10.1145/2628194.2628244
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The World Wide Web has become a primary source of information. Therefore, extracting data from Web sources has become a key technology. In this paper, we introduce a semi-automatic system Ducky : including a Web Wrapper which extracts data from Web sources and translates them into structured data. In Ducky, by defining a configuration file consisting in several parameters (URL of the Web page, CSS selectors which locates the data to retrieve and so on.), users do not need to write Web scraping programs at all. The definition is simple, yet can extract data flexibly from various structured Web pages. Additionally, Ducky provides a Web API and various output data formats: XML, JSON, CSV. Finally, experimentations confirmed that Ducky can accurately extract data from 22 different structured Web sources.
引用
收藏
页码:342 / 347
页数:6
相关论文
共 50 条
  • [31] AUTOMATIC WRAPPER SYSTEM FOR SEMI-STRUCTURED DOCUMENTS BASED ON DATA MINING
    Rancea, Irina
    Sgarciu, Valentin
    UNIVERSITY POLITEHNICA OF BUCHAREST SCIENTIFIC BULLETIN SERIES C-ELECTRICAL ENGINEERING AND COMPUTER SCIENCE, 2012, 74 (04): : 55 - 66
  • [32] Structured Data in Web Search
    Halevy, Alon
    PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 7 - 7
  • [33] An Analysis of Structured Data on the Web
    Dalvi, Nilesh
    Machanavajjhala, Ashwin
    Pang, Bo
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (07): : 680 - 691
  • [34] Temporal and spatial attribute extraction from web documents and time-specific regional web search system
    Tezuka, T
    Tanaka, K
    WEB AND WIRELESS GEOGRAPHICAL INFORMATION SYSTEMS, 2005, 3428 : 14 - 25
  • [35] Information extraction from Web pages using semi-structured data alignment
    Kuboyama, Tetsuji
    Miyahara, Tetsuhiro
    Hirokawa, Sachio
    Itou, Eisuke
    WMSCI 2005: 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Vol 1, 2005, : 42 - 47
  • [36] Opinion Extraction & Classification of Reviews from Web Documents
    Shandilya, Shishir K.
    Jain, Suresh
    2009 IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE, VOLS 1-3, 2009, : 924 - 927
  • [37] Cultural Heritage: knowledge extraction from web documents
    Sassolini, Eva
    Cinini, Alessandra
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 3363 - 3367
  • [38] Learning from similarity and information extraction from structured documents
    Holecek, Martin
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021,
  • [39] Cell Extraction and Horizontal-Scale Correction in Structured Documents
    Srivastava, Divya
    Harit, Gaurav
    PROCEEDINGS OF 3RD INTERNATIONAL CONFERENCE ON COMPUTER VISION AND IMAGE PROCESSING, CVIP 2018, VOL 2, 2020, 1024 : 53 - 64
  • [40] Learning from similarity and information extraction from structured documents
    Martin Holeček
    International Journal on Document Analysis and Recognition (IJDAR), 2021, 24 : 149 - 165