Assessing the performance of large language models (GPT-3.5 and GPT-4) and accurate clinical information for pediatric nephrology

被引:0
|
作者
Sav, Nadide Melike [1 ]
机构
[1] Duzce Univ, Dept Pediat Nephrol, Duzce, Turkiye
关键词
Artificial intelligence; ChatGPT; Clinical decision support systems; Cohen's d; Cronbach's alpha; Pediatric nephrology;
D O I
10.1007/s00467-025-06723-3
中图分类号
R72 [儿科学];
学科分类号
100202 ;
摘要
Background Artificial intelligence (AI) has emerged as a transformative tool in healthcare, offering significant advancements in providing accurate clinical information. However, the performance and applicability of AI models in specialized fields such as pediatric nephrology remain underexplored. This study is aimed at evaluating the ability of two AI-based language models, GPT-3.5 and GPT-4, to provide accurate and reliable clinical information in pediatric nephrology. The models were evaluated on four criteria: accuracy, scope, patient friendliness, and clinical applicability. Methods Forty pediatric nephrology specialists with >= 5 years of experience rated GPT-3.5 and GPT-4 responses to 10 clinical questions using a 1-5 scale via Google Forms. Ethical approval was obtained, and informed consent was secured from all participants. Results Both GPT-3.5 and GPT-4 demonstrated comparable performance across all criteria, with no statistically significant differences observed (p > 0.05). GPT-4 exhibited slightly higher mean scores in all parameters, but the differences were negligible (Cohen's d < 0.1 for all criteria). Reliability analysis revealed low internal consistency for both models (Cronbach's alpha ranged between 0.019 and 0.162). Correlation analysis indicated no significant relationship between participants' years of professional experience and their evaluations of GPT-3.5 (correlation coefficients ranged from - 0.026 to 0.074). Conclusions While GPT-3.5 and GPT-4 provided a foundational level of clinical information support, neither model exhibited superior performance in addressing the unique challenges of pediatric nephrology. The findings highlight the need for domain-specific training and integration of updated clinical guidelines to enhance the applicability and reliability of AI models in specialized fields. This study underscores the potential of AI in pediatric nephrology while emphasizing the importance of human oversight and the need for further refinements in AI applications.
引用
收藏
页数:7
相关论文
共 50 条
  • [21] Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources
    Nitin Srinivasan
    Jamil S. Samaan
    Nithya D. Rajeev
    Mmerobasi U. Kanu
    Yee Hui Yeo
    Kamran Samakar
    Surgical Endoscopy, 2024, 38 : 2522 - 2532
  • [22] Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases
    Li, David
    Gupta, Kartik
    Bhaduri, Mousumi
    Sathiadoss, Paul
    Bhatnagar, Sahir
    Chong, Jaron
    RADIOLOGY, 2024, 310 (01)
  • [23] GPT-3.5 Turbo and GPT-4 Turbo in Title and Abstract Screening for Systematic Reviews
    Oami, Takehiko
    Okada, Yohei
    Nakada, Taka-aki
    JMIR MEDICAL INFORMATICS, 2025, 13
  • [24] A Comparison Between GPT-3.5, GPT-4, and GPT-4V: Can the Large Language Model (ChatGPT) Pass the Japanese Board of Orthopaedic Surgery Examination?
    Nakajima, Nozomu
    Fujimori, Takahito
    Furuya, Masayuki
    Kanie, Yuya
    Imai, Hirotatsu
    Kita, Kosuke
    Uemura, Keisuke
    Okada, Seiji
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (03)
  • [25] Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot
    Rao, Arya
    Kim, John
    Kamineni, Meghana
    Pang, Michael
    Lie, Winston
    Dreyer, Keith J.
    Succi, Marc D.
    JOURNAL OF THE AMERICAN COLLEGE OF RADIOLOGY, 2023, 20 (10) : 990 - 997
  • [26] Comparing Vision-Capable Models, GPT-4 and Gemini, With GPT-3.5 on Taiwan's Pulmonologist Exam
    Chen, Chih-Hsiung
    Hsieh, Kuang-Yu
    Huang, Kuo-En
    Lai, Hsien-Yun
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (08)
  • [27] Inconsistently Accurate: Repeatability of GPT-3.5 and GPT-4 in Answering Radiology Board-style Multiple Choice Questions
    Ballard, David H.
    RADIOLOGY, 2024, 311 (02)
  • [28] ChatGPT as a Source of Information for Bariatric Surgery Patients: a Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5
    Jamil S. Samaan
    Nithya Rajeev
    Wee Han Ng
    Nitin Srinivasan
    Jonathan A. Busam
    Yee Hui Yeo
    Kamran Samakar
    Obesity Surgery, 2024, 34 : 1987 - 1989
  • [29] Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study
    Yudovich, Max Samuel
    Makarova, Elizaveta
    Hague, Christian Michael
    Raman, Jay Dilip
    JOURNAL OF EDUCATIONAL EVALUATION FOR HEALTH PROFESSIONS, 2024, 21 : 17
  • [30] Limitations of GPT-3.5 and GPT-4 in Applying Fleischner Society Guidelines to Incidental Lung Nodules
    Gamble, Joel
    Ferguson, Duncan
    Yuen, Joanna
    Sheikh, Adnan
    CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2024, 75 (02): : 412 - 416