Web scraping technologies in an API world

被引:80
|
作者
Glez-Pena, Daniel [1 ]
Lourenco, Analia [1 ,2 ]
Lopez-Fernandez, Hugo [1 ]
Reboiro-Jato, Miguel [1 ]
Fdez-Riverola, Florentino [3 ]
机构
[1] Univ Vigo, Dept Comp Sci, Vigo, Spain
[2] Univ Minho, Ctr Biol Engn, P-4719 Braga, Portugal
[3] Univ Vigo, Next Generat Comp Syst Grp, Vigo, Spain
关键词
Web scraping; data integration; interoperability; database interfaces; SET ENRICHMENT ANALYSIS; RESOURCE; DATABASE; BIOINFORMATICS; INFORMATION; INTEGRATION; SERVICES; COLLECTION; DISEASE; NATION;
D O I
10.1093/bib/bbt026
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Web services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis.
引用
收藏
页码:788 / 797
页数:10
相关论文
共 50 条
  • [1] Web Scraping versus Twitter API: A Comparison for a Credibility Analysis
    Dongo, Irvin
    Cadinale, Yudith
    Aguilera, Ana
    Martinez, Fabiola
    Quintero, Yuni
    Barrios, Sergio
    22ND INTERNATIONAL CONFERENCE ON INFORMATION INTEGRATION AND WEB-BASED APPLICATIONS & SERVICES (IIWAS2020), 2020, : 263 - 273
  • [2] Social Media Web Scraping using Social Media Developers API and Regex
    Dewi, Lusiana Citra
    Meiliana
    Chandra, Alvin
    4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND COMPUTATIONAL INTELLIGENCE (ICCSCI 2019) : ENABLING COLLABORATION TO ESCALATE IMPACT OF RESEARCH RESULTS FOR SOCIETY, 2019, 157 : 444 - 449
  • [3] Optimization of Convenience Stores' Distribution System with Web Scraping and Google API Service
    Le, Thai Quang
    Pishva, Davar
    2015 17TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY (ICACT), 2015,
  • [4] Application of Web Scraping and Google API Service to Optimize Convenience Stores' Distribution
    Quang Thai Le
    Pishva, Davar
    2015 17TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY (ICACT), 2015, : 478 - 482
  • [5] A qualitative and quantitative comparison between Web scraping and API methods for Twitter credibility analysis
    Dongo, Irvin
    Cardinale, Yudith
    Aguilera, Ana
    Martinez, Fabiola
    Quintero, Yuni
    Robayo, German
    Cabeza, David
    INTERNATIONAL JOURNAL OF WEB INFORMATION SYSTEMS, 2021, 17 (06) : 580 - 606
  • [6] Web scraping proxy
    Katseff, HP
    DR DOBBS JOURNAL, 2003, 28 (06): : 46 - +
  • [7] Web Scraping for Astronomy
    Derriere, S.
    Boch, T.
    ASTRONOMICAL DATA ANALYSIS SOFTWARE AND SYSTEMS XXI, 2012, 461 : 319 - 322
  • [8] Anwendungen des Web Scraping in der amtlichen StatistikApplications for web scraping in official statistics
    Heidi Kühnemann
    AStA Wirtschafts- und Sozialstatistisches Archiv, 2021, 15 (1) : 5 - 25
  • [9] Web Scraping Using R
    Bradley, Alex
    James, Richard J. E.
    ADVANCES IN METHODS AND PRACTICES IN PSYCHOLOGICAL SCIENCE, 2019, 2 (03) : 264 - 270
  • [10] Scraping the demos. Digitalization, web scraping and the democratic project
    Ulbricht, Lena
    DEMOCRATIZATION, 2020, 27 (03) : 426 - 442