共 2 条
Human-Comparable Sensitivity of Large Language Models inIdenti fying Eligible Studies Through Title and Abstract Screening:3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews
被引:0
|作者:
Matsui, Kentaro
[1
,2
]
Utsumi, Tomohiro
[2
,3
]
Aoki, Yumi
[4
]
Maruki, Taku
[5
]
Takeshima, Masahiro
[6
]
Takaesu, Yoshikazu
[7
]
机构:
[1] Natl Ctr Hosp, Natl Ctr Neurol & Psychiat, Dept Clin Lab, Kodaira, Japan
[2] Natl Inst Mental Hlth, Natl Ctr Neurol & Psychiat, Dept Sleep Wake Disorders, Kodaira, Japan
[3] Jikei Univ, Sch Med, Dept Psychiat, Tokyo, Japan
[4] St Lukes Int Univ, Grad Sch Nursing Sci, Tokyo, Japan
[5] Kyorin Univ, Sch Med, Dept Neuropsychiat, Tokyo, Japan
[6] Akita Univ, Grad Sch Med, Dept Neuropsychiat, Akita, Japan
[7] Univ Ryukyus, Grad Sch Med, Dept Neuropsychiat, 207 Uehara, Nishihara, Okinawa 9030215, Japan
关键词:
systematic review;
screening;
GPT-3.5;
GPT-4;
language model;
information science;
library science;
artificial intelligence;
prompt engineering;
meta-analysis;
D O I:
10.2196/52758
中图分类号:
R19 [保健组织与事业(卫生事业管理)];
学科分类号:
摘要:
Background: The screening process for systematic reviews is resource-intensive. Although previous machine learning solutionshave reported reductions in workload, they risked excluding relevant papers. Objective: We evaluated the performance of a 3-layer screening method using GPT-3.5 and GPT-4 to streamline the title andabstract-screening process for systematic reviews. Our goal is to develop a screening method that maximizes sensitivity foridentifying relevant records. Methods: We conducted screenings on 2 of our previous systematic reviews related to the treatment of bipolar disorder, with1381 records from the first review and 3146 from the second. Screenings were conducted using GPT-3.5 (gpt-3.5-turbo-0125)and GPT-4 (gpt-4-0125-preview) across three layers: (1) research design, (2) target patients, and (3) interventions and controls.The 3-layer screening was conducted using prompts tailored to each study. During this process, information extraction accordingto each study's inclusion criteria and optimization for screening were carried out using a GPT-4-based flow without manualadjustments. Records were evaluated at each layer, and those meeting the inclusion criteria at all layers were subsequently judgedas included. Results: On each layer, both GPT-3.5 and GPT-4 were able to process about 110 records per minute, and the total time requiredfor screening the first and second studies was approximately 1 hour and 2 hours, respectively. In the first study, thesensitivities/specificities of the GPT-3.5 and GPT-4 were 0.900/0.709 and 0.806/0.996, respectively. Both screenings by GPT-3.5and GPT-4 judged all 6 records used for the meta-analysis as included. In the second study, the sensitivities/specificities of theGPT-3.5 and GPT-4 were 0.958/0.116 and 0.875/0.855, respectively. The sensitivities for the relevant records align with thoseof human evaluators: 0.867-1.000 for the first study and 0.776-0.979 for the second study. Both screenings by GPT-3.5 and GPT-4judged all 9 records used for the meta-analysis as included. After accounting for justifiably excluded records by GPT-4, the sensitivities/specificities of the GPT-4 screening were 0.962/0.996 in the first study and 0.943/0.855 in the second study. Furtherinvestigation indicated that the cases incorrectly excluded by GPT-3.5 were due to a lack of domain knowledge, while the casesincorrectly excluded by GPT-4 were due to misinterpretations of the inclusion criteria. Conclusions: Our 3-layer screening method with GPT-4 demonstrated acceptable level of sensitivity and specificity that supportsits practical application in systematic review screenings. Future research should aim to generalize this approach and explore itseffectiveness in diverse settings, both medical and nonmedical, to fully establish its use and operational feasibility.
引用
收藏
页数:15
相关论文