Web scraping technologies in an API world

被引:80
|
作者
Glez-Pena, Daniel [1 ]
Lourenco, Analia [1 ,2 ]
Lopez-Fernandez, Hugo [1 ]
Reboiro-Jato, Miguel [1 ]
Fdez-Riverola, Florentino [3 ]
机构
[1] Univ Vigo, Dept Comp Sci, Vigo, Spain
[2] Univ Minho, Ctr Biol Engn, P-4719 Braga, Portugal
[3] Univ Vigo, Next Generat Comp Syst Grp, Vigo, Spain
关键词
Web scraping; data integration; interoperability; database interfaces; SET ENRICHMENT ANALYSIS; RESOURCE; DATABASE; BIOINFORMATICS; INFORMATION; INTEGRATION; SERVICES; COLLECTION; DISEASE; NATION;
D O I
10.1093/bib/bbt026
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Web services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis.
引用
收藏
页码:788 / 797
页数:10
相关论文
共 50 条
  • [21] Tutorial: Legality and Ethics of Web Scraping
    Krotov, Vlad
    Johnson, Leigh Redd
    Silva, Leiser
    COMMUNICATIONS OF THE ASSOCIATION FOR INFORMATION SYSTEMS, 2020, 47 (01): : 539 - +
  • [22] Usage of Web Scraping in the Pharmaceutical Sector
    Dahiya R.
    Nidhi
    Kumari K.
    Kumari S.
    Agarwal N.
    EAI Endorsed Transactions on Pervasive Health and Technology, 2023, 9 (01)
  • [23] Surfing the API Web: Web Concepts
    Wilde, Erik
    COMPANION PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2018 (WWW 2018), 2018, : 797 - 802
  • [24] TECHNOLOGIES FOR TEACHING MATHEMATICS VIA THE WORLD WIDE WEB
    Junquiera, K. E.
    JOURNAL FOR NEW GENERATION SCIENCES, 2009, 7 (01) : 51 - 70
  • [25] Privacy-enhancing technologies for the world wide web
    Oppliger, R
    COMPUTER COMMUNICATIONS, 2005, 28 (16) : 1791 - 1797
  • [26] The Value of Web Data Scraping: An Application to TripAdvisor
    Barbera, Gianluca
    Araujo, Luiz
    Fernandes, Silvia
    BIG DATA AND COGNITIVE COMPUTING, 2023, 7 (03)
  • [27] Rousillon: Scraping Distributed Hierarchical Web Data
    Chasins, Sarah E.
    Mueller, Maria
    Bodik, Rastislav
    UIST 2018: PROCEEDINGS OF THE 31ST ANNUAL ACM SYMPOSIUM ON USER INTERFACE SOFTWARE AND TECHNOLOGY, 2018, : 963 - 975
  • [28] A CONTEMPORARY RESEARCH STUDY ON WEB SCRAPING AND INNOVATION
    Roth, Katherine
    Farahmand, Kambiz
    Al-Amin, Md
    Mahmoud, Mohammed
    2023 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE, CSCI 2023, 2023, : 971 - 977
  • [29] The Use of Web Scraping to Explain Donation Behavior
    Ploder, Christian
    Spiess, Johannes
    Schloegl, Stephan
    Dilger, Thomas
    Bernsteiner, Reinhard
    Gander, Markus
    KNOWLEDGE MANAGEMENT IN ORGANISATIONS, KMO 2024, 2024, 2152 : 394 - 403
  • [30] API Prober - A Tool for Analyzing Web API Features and Clustering Web APIs
    Ma, Shang-Pin
    Hsu, Ming-Jen
    Chen, Hsiao-Jung
    Su, Yu-Sheng
    ADVANCES IN E-BUSINESS ENGINEERING FOR UBIQUITOUS COMPUTING, 2020, 41 : 81 - 96