Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature

被引:5
|
作者
Wang, Yu [1 ]
Li, Jinchao [1 ]
Naumann, Tristan [1 ]
Xiong, Chenyan [1 ]
Cheng, Hao [1 ]
Tinn, Robert [1 ]
Wong, Cliff [1 ]
Usuyama, Naoto [1 ]
Rogahn, Richard [1 ]
Shen, Zhihong [1 ]
Qin, Yang [1 ]
Horvitz, Eric [1 ]
Bennett, Paul N. [1 ]
Gao, Jianfeng [1 ]
Poon, Hoifung [1 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
关键词
Domain-specific pretraining; Search; Biomedical; NLP; COVID-19;
D O I
10.1145/3447548.3469053
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Information overload is a prevalent challenge in many high-value domains. A prominent case in point is the explosion of the biomedical literature on COVID-19, which swelled to hundreds of thousands of papers in a matter of months. In general, biomedical literature expands by two papers every minute, totalling over a million new papers every year. Search in the biomedical realm, and many other vertical domains is challenging due to the scarcity of direct supervision from click logs. Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck. We propose a general approach for vertical search based on domain-specific pretraining and present a case study for the biomedical domain. Despite being substantially simpler and not using any relevance labels for training or development, our method performs comparably or better than the best systems in the official TREC-COVID evaluation, a COVID-related biomedical search competition. Using distributed computing in modern cloud infrastructure, our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search, a new search experience for biomedical literature: https://aka.ms/biomedsearch.
引用
收藏
页码:3717 / 3725
页数:9
相关论文
共 50 条
  • [21] Frameworks Generate Domain-Specific Languages: A Case Study in the Multimedia Domain
    Amatriain, Xavier
    Arumi, Pau
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2011, 37 (04) : 544 - 558
  • [22] Metamodel search: Using XPath to search domain-specific models
    Sudarsan, Rajesh
    Gray, Jeff
    JOURNAL OF RESEARCH AND PRACTICE IN INFORMATION TECHNOLOGY, 2006, 38 (04): : 337 - 351
  • [23] A Case for Domain-Specific Curiosity in Mathematics
    Peterson, Emily Grossnickle
    Cohen, Jana
    EDUCATIONAL PSYCHOLOGY REVIEW, 2019, 31 (04) : 807 - 832
  • [24] A Case for Domain-Specific Curiosity in Mathematics
    Emily Grossnickle Peterson
    Jana Cohen
    Educational Psychology Review, 2019, 31 : 807 - 832
  • [25] Infrastructure for domain-specific aspect languages: the ReLAx case study
    Fabry, J.
    Tanter, E.
    D'Hondt, T.
    IET SOFTWARE, 2009, 3 (03) : 238 - 254
  • [26] Mining domain-specific Thesauri from Wikipedia: A case study
    Milne, David
    Medelyan, Olena
    Witten, H.
    2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 442 - +
  • [27] Constructing biomedical domain-specific knowledge graph with minimum supervision
    Yuan, Jianbo
    Jin, Zhiwei
    Guo, Han
    Jin, Hongxia
    Zhang, Xianchao
    Smith, Tristram
    Luo, Jiebo
    KNOWLEDGE AND INFORMATION SYSTEMS, 2020, 62 (01) : 317 - 336
  • [28] Patent Information Retrieval An Instance of Domain-specific Search
    Lupu, Mihai
    SIGIR 2012: PROCEEDINGS OF THE 35TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2012, : 1189 - 1190
  • [29] Constructing biomedical domain-specific knowledge graph with minimum supervision
    Jianbo Yuan
    Zhiwei Jin
    Han Guo
    Hongxia Jin
    Xianchao Zhang
    Tristram Smith
    Jiebo Luo
    Knowledge and Information Systems, 2020, 62 : 317 - 336
  • [30] PreparedLLM: effective pre-pretraining framework for domain-specific large language models
    Chen, Zhou
    Lin, Ming
    Wang, Zimeng
    Zang, Mingrun
    Bai, Yuqi
    BIG EARTH DATA, 2024, 8 (04) : 649 - 672