Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study

被引:0
|
作者
Li, Yu [1 ,2 ]
Huang, Chen-Kai [1 ]
Hu, Yi [1 ]
Zhou, Xiao-Dong [1 ]
He, Cong [1 ]
Zhong, Jia-Wei [1 ]
机构
[1] Nanchang Univ, Digest Dis Hosp,Jiangxi Med Coll, Jiangxi Clin Res Ctr Gastroenterol,Affiliated Hosp, Jiangxi Prov Key Lab Digest Dis,Dept Gastroenterol, 17 Yong Waizheng St, Nanchang 330006, Jiangxi, Peoples R China
[2] Nanchang Univ, Huankui Acad, Nanchang 330006, Jiangxi, Peoples R China
基金
中国国家自然科学基金;
关键词
ChatGPT-3.5; ChatGPT-4.0; Google Gemini; Hepatitis B infection; Accuracy;
D O I
10.3748/wjg.v31.i3.101092
中图分类号
R57 [消化系及腹部疾病];
学科分类号
摘要
BACKGROUND Patients with hepatitis B virus (HBV) infection require chronic and personalized care to improve outcomes. Large language models (LLMs) can potentially provide medical information for patients. AIM To examine the performance of three LLMs, ChatGPT-3.5, ChatGPT-4.0, and Google Gemini, in answering HBV-related questions. METHODS LLMs' responses to HBV-related questions were independently graded by two medical professionals using a four-point accuracy scale, and disagreements were resolved by a third reviewer. Each question was run three times using three LLMs. Readability was assessed via the Gunning Fog index and Flesch-Kincaid grade level. RESULTS Overall, all three LLM chatbots achieved high average accuracy scores for subjective questions (ChatGPT-3.5: 3.50; ChatGPT-4.0: 3.69; Google Gemini: 3.53, out of a maximum score of 4). With respect to objective questions, ChatGPT-4.0 achieved an 80.8% accuracy rate, compared with 62.9% for ChatGPT-3.5 and 73.1% for Google Gemini. Across the six domains, ChatGPT-4.0 performed better in terms of diagnosis, whereas Google Gemini demonstrated excellent clinical manifestations. Notably, in the readability analysis, the mean Gunning Fog index and Flesch-Kincaid grade level scores of the three LLM chatbots were significantly higher than the standard level eight, far exceeding the reading level of the normal population. CONCLUSION ur results highlight the potential of LLMs, especially ChatGPT-4.0, for delivering responses to HBV-related questions. LLMs may be an adjunctive informational tool for patients and physicians to improve outcomes. Nevertheless, current LLMs should not replace personalized treatment recommendations from physicians in the management of HBV infection.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study
    Workum, Jessica D.
    Volkers, Bas W. S.
    van de Sande, Davy
    Arora, Sumesh
    Goeijenbier, Marco
    Gommers, Diederik
    van Genderen, Michel E.
    CRITICAL CARE, 2025, 29 (01)
  • [2] Comparison of Performance of Large Language Models on Lung-RADS Related Questions
    Camur, Eren
    Cesur, Turay
    Gunes, Yasin Celal
    JCO GLOBAL ONCOLOGY, 2024, 10
  • [3] A Comparative Study: Can Large Language Models Beat Radiologists on PI-RADSv2.1-Related Questions?
    Eren, Camur
    Turay, Cesur
    Celal, Guenes Yasin
    JOURNAL OF MEDICAL AND BIOLOGICAL ENGINEERING, 2024, 44 (06) : 821 - 830
  • [4] Comparative Performance of the Leading Large Language Models in Answering Complex Rhinoplasty Consultation Questions
    Goshtasbi, Khodayar
    Best, Corliss
    Powers, Bethany
    Ching, Harry
    Pastorek, Norman J.
    Altman, Donald
    Adamson, Peter
    Krugman, Mark
    Wong, Brian J. F.
    FACIAL PLASTIC SURGERY & AESTHETIC MEDICINE, 2025,
  • [5] Assessing Accuracy of ChatGPT on Addressing Helicobacter pylori Infection-Related Questions: A National Survey and Comparative Study
    Hu, Yi
    Lai, Yongkang
    Liao, Foqiang
    Shu, Xu
    Zhu, Yin
    Du, Yi-Qi
    Lu, Nong-Hua
    HELICOBACTER, 2024, 29 (04)
  • [6] Comparative Evaluation of the Accuracies of Large Language Models in Answering VI-RADS-Related Questions
    Camur, Eren
    Cesur, Turay
    Gunes, Yasin Celal
    KOREAN JOURNAL OF RADIOLOGY, 2024, 25 (08) : 767 - 768
  • [7] A Comparative Study of Large Language Models, Human Experts, and Expert-Edited Large Language Models to Neuro-Ophthalmology Questions
    Tailor, Prashant D.
    Dalvin, Lauren A.
    Starr, Matthew R.
    Tajfirouz, Deena A.
    Chodnicki, Kevin D.
    Brodsky, Michael C.
    Mansukhani, Sasha A.
    Moss, Heather E.
    Lai, Kevin E.
    Ko, Melissa W.
    Mackay, Devin D.
    Di Nome, Marie A.
    Dumitrascu, Oana M.
    Pless, Misha L.
    Eggenberger, Eric R.
    Chen, John J.
    JOURNAL OF NEURO-OPHTHALMOLOGY, 2025, 45 (01) : 71 - 77
  • [8] Performance Assessment of Large Language Models in Medical Consultation: Comparative Study
    Seo, Sujeong
    Kim, Kyuli
    Yang, Heyoung
    JMIR MEDICAL INFORMATICS, 2025, 13
  • [10] Performance of Large Language Models on Medical Oncology Examination Questions
    Longwell, Jack B.
    Hirsch, Ian
    Binder, Fernando
    Conchas, Galileo Arturo Gonzalez
    Mau, Daniel
    Jang, Raymond
    Krishnan, Rahul G.
    Grant, Robert C.
    JAMA NETWORK OPEN, 2024, 7 (06) : e2417641