Towards a unified search: Improving PubMed retrieval with full text

被引：3

作者：

Kim W. ^{[1
]}

Yeganova L. ^{[1
]}

Comeau D.C. ^{[1
]}

Wilbur W.J. ^{[1
]}

Lu Z. ^{[1
]}

机构：

[1] National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, 20894, MD

来源：

Journal of Biomedical Informatics | 2022年 / 134卷

基金：

美国国家卫生研究院;

关键词：

Combining abstract with full text; Full text search; Information retrieval; PubMed search engine; Search relevance gold standard;

D O I：

10.1016/j.jbi.2022.104211

中图分类号：

学科分类号：

摘要：

Objective: A significant number of recent articles in PubMed have full text available in PubMed Central®, and the availability of full texts has been consistently growing. However, it is not currently possible for a user to simultaneously query the contents of both databases and receive a single integrated search result. In this study, we investigate how to score full text articles given a multitoken query and how to combine those full text article scores with scores originating from abstracts and achieve an overall improved retrieval performance. Materials and methods: For scoring full text articles, we propose a method to combine information coming from different sections by converting the traditionally used BM25 scores into log odds ratio scores which can be treated uniformly. We further propose a method that successfully combines scores from two heterogenous retrieval sources – full text articles and abstract only articles – by balancing the contributions of their respective scores through a probabilistic transformation. We use PubMed click data that consists of queries sampled from PubMed user logs along with a subset of retrieved and clicked documents to train the probabilistic functions and to evaluate retrieval effectiveness. Results and conclusions: Random ranking achieves 0.579 MAP score on our PubMed click data. BM25 ranking on PubMed abstracts improves the MAP by 10.6%. For full text documents, experiments confirm that BM25 section scores are of different value depending on the section type and are not directly comparable. Naïvely using the body text of articles along with abstract text degrades the overall quality of the search. The proposed log odds ratio scores normalize and combine the contributions of occurrences of query tokens in different sections. By including full text where available, we gain another 0.67%, or 7% relative improvement over abstract alone. We find an advantage in the more accurate estimate of the value of BM25 scores depending on the section from which they were produced. Taking the sum of top three section scores performs the best. © 2022

引用

共 50 条

[21] RESEARCH INTO FULL-TEXT RETRIEVAL
OJALA, M
DATABASE, 1990, 13 (04): : 78 - 80
[22] FULL-TEXT ONLINE RETRIEVAL
COLBERT, AW
ONLINE, 1988, 12 (02): : 91 - 91
[23] ZYINDEX - FULL TEXT RETRIEVAL POWER
HOLLAND, MP
ONLINE, 1985, 9 (04): : 38 - 42
[24] Full Text Retrieval Using PowerBuilder
Wang, Xianbing
2010 INTERNATIONAL CONFERENCE ON INFORMATION, ELECTRONIC AND COMPUTER SCIENCE, VOLS 1-3, 2010, : 1930 - 1932
[25] FULL-TEXT INFORMATION RETRIEVAL
FAY, RJ
LAW LIBRARY JOURNAL, 1971, 64 (02): : 167 - 175
[26] Harvesting for full-text retrieval
Simeoni, F
Yakici, M
Neely, S
Crestani, F
DIGITAL LIBRARIES: IMPLEMENTING STRATEGIES AND SHARING EXPERIENCES, PROCEEDINGS, 2005, 3815 : 204 - 213
[27] FULL TEXT DATABASE RETRIEVAL PERFORMANCE
TENOPIR, C
ONLINE REVIEW, 1985, 9 (02): : 149 - 164
[28] META: A Unified Toolkit for Text Retrieval and Analysis
Massung, Sean
Geigle, Chase
Zhai, ChengXiang
PROCEEDINGS OF 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL-2016): SYSTEM DEMONSTRATIONS, 2016, : 91 - 96
[29] A unified approach towards text recognition
Hong, T
Hull, JJ
Srihari, SN
DOCUMENT RECOGNITION III, 1996, 2660 : 27 - 36
[30] Improving text memory by organizing interfering text at retrieval
Mann, T
Brenner, LA
AMERICAN JOURNAL OF PSYCHOLOGY, 1996, 109 (04): : 539 - 549

← 1 2 3 4 5 →