Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study

被引:0
|
作者
Li, Yu [1 ,2 ]
Huang, Chen-Kai [1 ]
Hu, Yi [1 ]
Zhou, Xiao-Dong [1 ]
He, Cong [1 ]
Zhong, Jia-Wei [1 ]
机构
[1] Nanchang Univ, Digest Dis Hosp,Jiangxi Med Coll, Jiangxi Clin Res Ctr Gastroenterol,Affiliated Hosp, Jiangxi Prov Key Lab Digest Dis,Dept Gastroenterol, 17 Yong Waizheng St, Nanchang 330006, Jiangxi, Peoples R China
[2] Nanchang Univ, Huankui Acad, Nanchang 330006, Jiangxi, Peoples R China
基金
中国国家自然科学基金;
关键词
ChatGPT-3.5; ChatGPT-4.0; Google Gemini; Hepatitis B infection; Accuracy;
D O I
10.3748/wjg.v31.i3.101092
中图分类号
R57 [消化系及腹部疾病];
学科分类号
摘要
BACKGROUND Patients with hepatitis B virus (HBV) infection require chronic and personalized care to improve outcomes. Large language models (LLMs) can potentially provide medical information for patients. AIM To examine the performance of three LLMs, ChatGPT-3.5, ChatGPT-4.0, and Google Gemini, in answering HBV-related questions. METHODS LLMs' responses to HBV-related questions were independently graded by two medical professionals using a four-point accuracy scale, and disagreements were resolved by a third reviewer. Each question was run three times using three LLMs. Readability was assessed via the Gunning Fog index and Flesch-Kincaid grade level. RESULTS Overall, all three LLM chatbots achieved high average accuracy scores for subjective questions (ChatGPT-3.5: 3.50; ChatGPT-4.0: 3.69; Google Gemini: 3.53, out of a maximum score of 4). With respect to objective questions, ChatGPT-4.0 achieved an 80.8% accuracy rate, compared with 62.9% for ChatGPT-3.5 and 73.1% for Google Gemini. Across the six domains, ChatGPT-4.0 performed better in terms of diagnosis, whereas Google Gemini demonstrated excellent clinical manifestations. Notably, in the readability analysis, the mean Gunning Fog index and Flesch-Kincaid grade level scores of the three LLM chatbots were significantly higher than the standard level eight, far exceeding the reading level of the normal population. CONCLUSION ur results highlight the potential of LLMs, especially ChatGPT-4.0, for delivering responses to HBV-related questions. LLMs may be an adjunctive informational tool for patients and physicians to improve outcomes. Nevertheless, current LLMs should not replace personalized treatment recommendations from physicians in the management of HBV infection.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Comparative Analysis of Multimodal Large Language Model Performance on Clinical Vignette Questions
    Han, Tianyu
    Adams, Lisa C.
    Bressem, Keno K.
    Busch, Felix
    Nebelung, Sven
    Truhn, Daniel
    JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2024, 331 (15): : 1320 - 1321
  • [22] Analyzing the Efficacy of Large Language Models: A Comparative Study
    Khetarpaul, Sonia
    Sharma, Dolly
    Sinha, Shreya
    Nagpal, Aryan
    Narang, Aarush
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT I, DEXA 2024, 2024, 14910 : 215 - 221
  • [23] Comment on: Performance of Generative Large Language Models on Ophthalmology Board Style Questions
    Kleebayoon, Amnuay
    Wiwanitkit, Viroj
    AMERICAN JOURNAL OF OPHTHALMOLOGY, 2023, 256 : 200 - 200
  • [24] Performance of Generative Large Language Models on Ophthalmology Board-Style Questions
    Cai, Louis Z.
    Shaheen, Abdulla
    Jin, Andrew
    Fukui, Riya
    Yi, Jonathan S.
    Yannuzzi, Nicolas
    Alabiad, Chrisfouad
    AMERICAN JOURNAL OF OPHTHALMOLOGY, 2023, 254 : 141 - 149
  • [25] Performance of Large Language Models ChatGPT and Gemini on Workplace Management Questions in Radiology
    Leutz-Schmidt, Patricia
    Palm, Viktoria
    Mathy, Rene Michael
    Groezinger, Martin
    Kauczor, Hans-Ulrich
    Jang, Hyungseok
    Sedaghat, Sam
    DIAGNOSTICS, 2025, 15 (04)
  • [26] Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions
    Du, Wei
    Jin, Xueting
    Harris, Jaryse Carol
    Brunetti, Alessandro
    Johnson, Erika
    Leung, Olivia
    Li, Xingchen
    Walle, Selemon
    Yu, Qing
    Zhou, Xiao
    Bian, Fang
    Mckenzie, Kajanna
    Kanathanavanich, Manita
    Ozcelik, Yusuf
    El-Sharkawy, Farah
    Koga, Shunsuke
    ANNALS OF DIAGNOSTIC PATHOLOGY, 2024, 73
  • [27] Performance of large language models on benign prostatic hyperplasia frequently asked questions
    Zhang, YuNing
    Dong, Yijie
    Mei, Zihan
    Hou, Yiqing
    Wei, Minyan
    Yeung, Yat Hin
    Xu, Jiale
    Hua, Qing
    Lai, LiMei
    Li, Ning
    Xia, ShuJun
    Zhou, Chun
    Zhou, JianQiao
    PROSTATE, 2024, 84 (09): : 807 - 813
  • [28] Study Tests Large Language Models' Ability to Answer Clinical Questions
    Harris, Emily
    JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2023, 330 (06): : 496 - 496
  • [29] Comparative Readability Assessment of Four Large Language Models in Answers to Common Contraception Questions
    Patel, Anisha V.
    Panakam, Aisvarya
    Amin, Kanhai
    Doshi, Rushabh H.
    Patil, Ankita
    Sheth, Sangini S.
    OBSTETRICS AND GYNECOLOGY, 2024, 143 (5S): : 13S - 13S
  • [30] Animal models for the study of hepatitis B virus infection
    Guo, Wei-Na
    Zhu, Bin
    Ai, Ling
    Yang, Dong-Liang
    Wang, Bao-Ju
    ZOOLOGICAL RESEARCH, 2018, 39 (01) : 25 - 31