Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study

被引:0
|
作者
Li, Yu [1 ,2 ]
Huang, Chen-Kai [1 ]
Hu, Yi [1 ]
Zhou, Xiao-Dong [1 ]
He, Cong [1 ]
Zhong, Jia-Wei [1 ]
机构
[1] Nanchang Univ, Digest Dis Hosp,Jiangxi Med Coll, Jiangxi Clin Res Ctr Gastroenterol,Affiliated Hosp, Jiangxi Prov Key Lab Digest Dis,Dept Gastroenterol, 17 Yong Waizheng St, Nanchang 330006, Jiangxi, Peoples R China
[2] Nanchang Univ, Huankui Acad, Nanchang 330006, Jiangxi, Peoples R China
基金
中国国家自然科学基金;
关键词
ChatGPT-3.5; ChatGPT-4.0; Google Gemini; Hepatitis B infection; Accuracy;
D O I
10.3748/wjg.v31.i3.101092
中图分类号
R57 [消化系及腹部疾病];
学科分类号
摘要
BACKGROUND Patients with hepatitis B virus (HBV) infection require chronic and personalized care to improve outcomes. Large language models (LLMs) can potentially provide medical information for patients. AIM To examine the performance of three LLMs, ChatGPT-3.5, ChatGPT-4.0, and Google Gemini, in answering HBV-related questions. METHODS LLMs' responses to HBV-related questions were independently graded by two medical professionals using a four-point accuracy scale, and disagreements were resolved by a third reviewer. Each question was run three times using three LLMs. Readability was assessed via the Gunning Fog index and Flesch-Kincaid grade level. RESULTS Overall, all three LLM chatbots achieved high average accuracy scores for subjective questions (ChatGPT-3.5: 3.50; ChatGPT-4.0: 3.69; Google Gemini: 3.53, out of a maximum score of 4). With respect to objective questions, ChatGPT-4.0 achieved an 80.8% accuracy rate, compared with 62.9% for ChatGPT-3.5 and 73.1% for Google Gemini. Across the six domains, ChatGPT-4.0 performed better in terms of diagnosis, whereas Google Gemini demonstrated excellent clinical manifestations. Notably, in the readability analysis, the mean Gunning Fog index and Flesch-Kincaid grade level scores of the three LLM chatbots were significantly higher than the standard level eight, far exceeding the reading level of the normal population. CONCLUSION ur results highlight the potential of LLMs, especially ChatGPT-4.0, for delivering responses to HBV-related questions. LLMs may be an adjunctive informational tool for patients and physicians to improve outcomes. Nevertheless, current LLMs should not replace personalized treatment recommendations from physicians in the management of HBV infection.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Animal models for the study of hepatitis B virus infection
    Wei-Na Guo
    Bin Zhu
    Ling Ai
    Dong-Liang Yang
    Bao-Ju Wang
    Zoological Research, 2018, 39 (01) : 25 - 31
  • [32] Infection-related hypoglycemia in institutionalized demented patients: A comparative study of diabetic and nondiabetic patients
    Arinzon, Zeev
    Fidelman, Zeev
    Berner, Yitshal N.
    Adunsky, Abraham
    ARCHIVES OF GERONTOLOGY AND GERIATRICS, 2007, 45 (02) : 191 - 200
  • [33] Parvovirus B19 infection-related acute hepatitis after rituximab-containing regimen for treatment of diffuse large B-cell lymphoma
    Yang, Shih-Hung
    Lin, Long-Wei
    Fang, Yu-Jen
    Cheng, Ann-Lii
    Kuo, Sung-Hsin
    ANNALS OF HEMATOLOGY, 2012, 91 (02) : 291 - 294
  • [34] Parvovirus B19 infection-related acute hepatitis after rituximab-containing regimen for treatment of diffuse large B-cell lymphoma
    Shih-Hung Yang
    Long-Wei Lin
    Yu-Jen Fang
    Ann-Lii Cheng
    Sung-Hsin Kuo
    Annals of Hematology, 2012, 91 : 291 - 294
  • [35] Accuracy of Large Language Models in ACR Manual on Contrast Media-Related Questions
    Gunes, Yasin Celal
    Cesur, Turay
    ACADEMIC RADIOLOGY, 2024, 31 (07)
  • [36] Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023
    Khalpey, Zain
    Kumar, Ujjawal
    King, Nicholas
    Abraham, Alyssa
    Khalpey, Amina H.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (07)
  • [37] Comparative performance of risk prediction models for hepatitis B-related hepatocellular carcinoma in the United States
    Kim, Hyun-seok
    Yu, Xian
    Kramer, Jennifer
    Thrift, Aaron P.
    Richardson, Pete
    Hsu, Yao-Chun
    Flores, Avegail
    El-Serag, Hashem B.
    Kanwal, Fasiha
    JOURNAL OF HEPATOLOGY, 2022, 76 (02) : 294 - 301
  • [38] A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology
    Murthy, Aravind Baskar
    Palaniappan, Vijayasankar
    Radhakrishnan, Suganya
    Rajaa, Sathish
    Karthikeyan, Kaliaperumal
    INDIAN DERMATOLOGY ONLINE JOURNAL, 2025, 16 (02) : 241 - 247
  • [39] Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding
    Wang, Yuqing
    Zhao, Yun
    Petzold, Linda
    MACHINE LEARNING FOR HEALTHCARE CONFERENCE, VOL 219, 2023, 219
  • [40] A Comparative Study of Large Language Models for Goal Model Extraction
    Siddeshwar, Vaishali
    Alwidian, Sanaa
    Makrehchi, Masoud
    ACM/IEEE 27TH INTERNATIONAL CONFERENCE ON MODEL DRIVEN ENGINEERING LANGUAGES AND SYSTEMS: COMPANION PROCEEDINGS, MODELS 2024, 2024, : 253 - 263