Assessing the performance of large language models (GPT-3.5 and GPT-4) and accurate clinical information for pediatric nephrology

被引：0

作者：

Sav, Nadide Melike ^{[1
]}

机构：

[1] Duzce Univ, Dept Pediat Nephrol, Duzce, Turkiye

来源：

PEDIATRIC NEPHROLOGY | 2025年

关键词：

Artificial intelligence; ChatGPT; Clinical decision support systems; Cohen's d; Cronbach's alpha; Pediatric nephrology;

D O I：

10.1007/s00467-025-06723-3

中图分类号：

R72 [儿科学];

学科分类号：

100202 ;

摘要：

Background Artificial intelligence (AI) has emerged as a transformative tool in healthcare, offering significant advancements in providing accurate clinical information. However, the performance and applicability of AI models in specialized fields such as pediatric nephrology remain underexplored. This study is aimed at evaluating the ability of two AI-based language models, GPT-3.5 and GPT-4, to provide accurate and reliable clinical information in pediatric nephrology. The models were evaluated on four criteria: accuracy, scope, patient friendliness, and clinical applicability. Methods Forty pediatric nephrology specialists with >= 5 years of experience rated GPT-3.5 and GPT-4 responses to 10 clinical questions using a 1-5 scale via Google Forms. Ethical approval was obtained, and informed consent was secured from all participants. Results Both GPT-3.5 and GPT-4 demonstrated comparable performance across all criteria, with no statistically significant differences observed (p > 0.05). GPT-4 exhibited slightly higher mean scores in all parameters, but the differences were negligible (Cohen's d < 0.1 for all criteria). Reliability analysis revealed low internal consistency for both models (Cronbach's alpha ranged between 0.019 and 0.162). Correlation analysis indicated no significant relationship between participants' years of professional experience and their evaluations of GPT-3.5 (correlation coefficients ranged from - 0.026 to 0.074). Conclusions While GPT-3.5 and GPT-4 provided a foundational level of clinical information support, neither model exhibited superior performance in addressing the unique challenges of pediatric nephrology. The findings highlight the need for domain-specific training and integration of updated clinical guidelines to enhance the applicability and reliability of AI models in specialized fields. This study underscores the potential of AI in pediatric nephrology while emphasizing the importance of human oversight and the need for further refinements in AI applications.

引用

页数：7

共 50 条

[31] The performance of ChatGPT on orthopaedic in-service training exams: A comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education
Rizzo, Michael G.
Cai, Nathan
Constantinescu, David
JOURNAL OF ORTHOPAEDICS, 2024, 50 : 70 - 75
[32] Evaluating prompt engineering on GPT-3.5's performance in USMLE-style medical calculations and clinical scenarios generated by GPT-4
Patel, Dhavalkumar
Raut, Ganesh
Zimlichman, Eyal
Cheetirala, Satya Narayan
Nadkarni, Girish N.
Glicksberg, Benjamin S.
Apakama, Donald U.
Bell, Elijah J.
Freeman, Robert
Timsina, Prem
Klang, Eyal
SCIENTIFIC REPORTS, 2024, 14 (01):
[33] ChatGPT as a Source of Information for Bariatric Surgery Patients: a Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5
Samaan, Jamil S.
Rajeev, Nithya
Ng, Wee Han
Srinivasan, Nitin
Busam, Jonathan A.
Yeo, Yee Hui
Samakar, Kamran
OBESITY SURGERY, 2024, 34 (05) : 1987 - 1989
[34] Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination
Liu, Chiu-Liang
Ho, Chien-Ta
Wu, Tzu-Chi
HEALTHCARE, 2024, 12 (17)
[35] Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study
Meyer, Annika
Riese, Janik
Streichert, Thomas
JMIR MEDICAL EDUCATION, 2024, 10
[36] BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study
Cozzi, Andrea
Pinker, Katja
Hidber, Andri
Zhang, Tianyu
Bonomo, Luca
Lo Gullo, Roberto
Christianson, Blake
Curti, Marco
Rizzo, Stefania
Del Grande, Filippo
Mann, Ritse M.
Schiaffino, Simone
RADIOLOGY, 2024, 311 (01)
[37] Assessing readability of explanations and reliability of answers by GPT-3.5 and GPT-4 in non-traumatic spinal cord injury education
Garcia-Rudolph, Alejandro
Sanchez-Pinsach, David
Wright, Mark Andrew
Opisso, Eloy
Vidal, Joan
MEDICAL TEACHER, 2024,
[38] Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing
Carlo A. Mallio
Andrea C. Sertorio
Caterina Bernetti
Bruno Beomonte Zobel
La radiologia medica, 2023, 128 : 808 - 812
[39] Exploring new educational approaches in neuropathic pain: assessing accuracy and consistency of artificial intelligence responses from GPT-3.5 and GPT-4
Garcia-Rudolph, Alejandro
Sanchez-Pinsach, David
Opisso, Eloy
Soler, Maria Dolors
PAIN MEDICINE, 2024, 26 (01) : 48 - 50
[40] Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions
Moshirfar, Majid
Altaf, Amal W.
Stoakes, Isabella M.
Tuttle, Jared J.
Hoopes, Phillip C.
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (06)

← 1 2 3 4 5 →