Ducky : A Data Extraction System for Various Structured Web Documents

被引:2
|
作者
Kanaoka, Kei [1 ]
Fujii, Yotaro [1 ]
Toyama, Motomichi [1 ]
机构
[1] Keio Univ, Dept Comp Sci, Yokohama, Kanagawa, Japan
关键词
Data Extraction; Web scraping; Web Wrapper; CSS selector;
D O I
10.1145/2628194.2628244
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The World Wide Web has become a primary source of information. Therefore, extracting data from Web sources has become a key technology. In this paper, we introduce a semi-automatic system Ducky : including a Web Wrapper which extracts data from Web sources and translates them into structured data. In Ducky, by defining a configuration file consisting in several parameters (URL of the Web page, CSS selectors which locates the data to retrieve and so on.), users do not need to write Web scraping programs at all. The definition is simple, yet can extract data flexibly from various structured Web pages. Additionally, Ducky provides a Web API and various output data formats: XML, JSON, CSV. Finally, experimentations confirmed that Ducky can accurately extract data from 22 different structured Web sources.
引用
收藏
页码:342 / 347
页数:6
相关论文
共 50 条
  • [1] ENRICHED MARKING OF STRUCTURED DATA FOR WEB DOCUMENTS
    Adida, Ben
    Herman, Ivan
    Sporny, Manu
    Birbeck, Mark
    ANALES DE DOCUMENTACION, 2013, 16 (01):
  • [2] Information extraction from semi-structured web documents
    Yun, Bo-Hyun
    Seo, Chang-Ho
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598
  • [3] FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents
    Lin, Bill Yuchen
    Sheng, Ying
    Nguyen Vo
    Tata, Sandeep
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 1092 - 1102
  • [4] EGA: An algorithm for automatic semi-structured Web documents extraction
    Li, LY
    Tang, SW
    Yang, DQ
    Wang, TJ
    Su, ZH
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, 2004, 2973 : 787 - 798
  • [5] Web-Scale Extraction of Structured Data
    Cafarella, Michael J.
    Madhavan, Jayant
    Halevy, Alon
    SIGMOD RECORD, 2008, 37 (04) : 55 - 61
  • [6] Automatic Extraction of Structured Web Data with Domain Knowledge
    Derouiche, Nora
    Cautis, Bogdan
    Abdessalem, Talel
    2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 726 - 737
  • [7] ObjectRunner: Lightweight, Targeted Extraction and Querying of Structured Web Data
    Abdessalem, Talel
    Cautis, Bogdan
    Derouiche, Nora
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (02): : 1585 - 1588
  • [8] Web Service for Data Extraction from Semi-structured Data Sources
    Yashina, Marina V.
    Nakonechnyy, Ivan I.
    PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON DEPENDABILITY AND COMPLEX SYSTEMS DEPCOS-RELCOMEX, 2014, 286 : 499 - 510
  • [9] Hidden schema extraction in web documents
    Carchiolo, V
    Longheu, A
    Malgeri, M
    DATABASES IN NETWORKED INFORMATION SYSTEMS, PROCEEDINGS, 2003, 2822 : 42 - 52
  • [10] Hidden schema extraction in web documents
    1600, International Affairs Committee; University of Aizu, (Springer Verlag):