Web scraping technologies in an API world

被引:80
|
作者
Glez-Pena, Daniel [1 ]
Lourenco, Analia [1 ,2 ]
Lopez-Fernandez, Hugo [1 ]
Reboiro-Jato, Miguel [1 ]
Fdez-Riverola, Florentino [3 ]
机构
[1] Univ Vigo, Dept Comp Sci, Vigo, Spain
[2] Univ Minho, Ctr Biol Engn, P-4719 Braga, Portugal
[3] Univ Vigo, Next Generat Comp Syst Grp, Vigo, Spain
关键词
Web scraping; data integration; interoperability; database interfaces; SET ENRICHMENT ANALYSIS; RESOURCE; DATABASE; BIOINFORMATICS; INFORMATION; INTEGRATION; SERVICES; COLLECTION; DISEASE; NATION;
D O I
10.1093/bib/bbt026
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Web services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis.
引用
收藏
页码:788 / 797
页数:10
相关论文
共 50 条
  • [31] Novel Scratch Programming Blocks for Web Scraping
    Park, Youngki
    Shin, Youhyun
    ELECTRONICS, 2022, 11 (16)
  • [32] Firefly Optimization Algorithm Based Web Scraping for Web Citation Extraction
    E. Suganya
    S. Vijayarani
    Wireless Personal Communications, 2021, 118 : 1481 - 1505
  • [33] Firefly Optimization Algorithm Based Web Scraping for Web Citation Extraction
    Suganya, E.
    Vijayarani, S.
    WIRELESS PERSONAL COMMUNICATIONS, 2021, 118 (02) : 1481 - 1505
  • [34] Analyzing the Flow of Trust in the Virtual World With Semantic Web Technologies
    Zhang, Qingpeng
    DiFranzo, Dominic
    Gloria, Marie Joan Kristine
    Makni, Bassem
    Hendler, James A.
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2018, 5 (03): : 807 - 815
  • [35] Use of World Wide Web technologies and procurement process implications
    Walker, DHT
    Rowlinson, S
    PROCUREMENT SYSTEMS: A GUIDE TO BEST PRACTICE IN CONSTRUCTION, 1999, : 184 - 205
  • [36] Scaling Web API Integrations
    Chari, Guido
    Sheffer, Brandon
    Branavan, S. R. K.
    D'ippolito, Nicolas
    2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING IN PRACTICE, ICSE-SEIP, 2023, : 13 - 23
  • [37] Towards End-User Web Scraping for Customization
    Katongo, Kapaya
    Litt, Geoffrey
    Jackson, Daniel
    COMPANION PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON THE ART, SCIENCE, AND ENGINEERING OF PROGRAMMING (PROGRAMMING 2021 COMPANION), 2021, : 49 - 59
  • [38] Design and analyses of web scraping on burstable virtual machines
    Drummond, Lucia Maria A.
    Andrade, Luciano
    Muniz, Pedro de Brito
    Pereira, Matheus Marotti
    Silva, Thiago do Prado
    Teylo, Luan
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2024, 36 (09):
  • [39] Cloud Based Web Scraping for Big Data Applications
    Chaulagain, Ram Sharan
    Pandey, Santosh
    Basnet, Sadhu Ram
    Shakya, Subarna
    2017 IEEE INTERNATIONAL CONFERENCE ON SMART CLOUD (SMARTCLOUD), 2017, : 138 - 143
  • [40] Flat rent price prediction in Berlin with web scraping
    Camilo Meyberg
    Ulrich Rendtel
    Holger Leerhoff
    AStA Wirtschafts- und Sozialstatistisches Archiv, 2024, 18 (2) : 245 - 278