Assessing the performance of large language models (GPT-3.5 and GPT-4) and accurate clinical information for pediatric nephrology

被引:0
|
作者
Sav, Nadide Melike [1 ]
机构
[1] Duzce Univ, Dept Pediat Nephrol, Duzce, Turkiye
关键词
Artificial intelligence; ChatGPT; Clinical decision support systems; Cohen's d; Cronbach's alpha; Pediatric nephrology;
D O I
10.1007/s00467-025-06723-3
中图分类号
R72 [儿科学];
学科分类号
100202 ;
摘要
Background Artificial intelligence (AI) has emerged as a transformative tool in healthcare, offering significant advancements in providing accurate clinical information. However, the performance and applicability of AI models in specialized fields such as pediatric nephrology remain underexplored. This study is aimed at evaluating the ability of two AI-based language models, GPT-3.5 and GPT-4, to provide accurate and reliable clinical information in pediatric nephrology. The models were evaluated on four criteria: accuracy, scope, patient friendliness, and clinical applicability. Methods Forty pediatric nephrology specialists with >= 5 years of experience rated GPT-3.5 and GPT-4 responses to 10 clinical questions using a 1-5 scale via Google Forms. Ethical approval was obtained, and informed consent was secured from all participants. Results Both GPT-3.5 and GPT-4 demonstrated comparable performance across all criteria, with no statistically significant differences observed (p > 0.05). GPT-4 exhibited slightly higher mean scores in all parameters, but the differences were negligible (Cohen's d < 0.1 for all criteria). Reliability analysis revealed low internal consistency for both models (Cronbach's alpha ranged between 0.019 and 0.162). Correlation analysis indicated no significant relationship between participants' years of professional experience and their evaluations of GPT-3.5 (correlation coefficients ranged from - 0.026 to 0.074). Conclusions While GPT-3.5 and GPT-4 provided a foundational level of clinical information support, neither model exhibited superior performance in addressing the unique challenges of pediatric nephrology. The findings highlight the need for domain-specific training and integration of updated clinical guidelines to enhance the applicability and reliability of AI models in specialized fields. This study underscores the potential of AI in pediatric nephrology while emphasizing the importance of human oversight and the need for further refinements in AI applications.
引用
收藏
页数:7
相关论文
共 50 条
  • [41] RE: Exploring new educational approaches in neuropathic pain: assessing accuracy and consistency of AI responses from GPT-3.5 and GPT-4
    Daungsupawong, Hinpetch
    Wiwanitkit, Viroj
    PAIN MEDICINE, 2024,
  • [42] Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing
    Mallio, Carlo A.
    Sertorio, Andrea C.
    Bernetti, Caterina
    Beomonte Zobel, Bruno
    RADIOLOGIA MEDICA, 2023, 128 (07): : 808 - 812
  • [43] RE: Exploring new educational approaches in neuropathic pain: assessing accuracy and consistency of AI responses from GPT-3.5 and GPT-4
    Garcia-Rudolph, Alejandro
    Sanchez-Pinsach, David
    Opisso, Eloy
    Soler, Maria Dolors
    PAIN MEDICINE, 2024,
  • [44] Advancements in AI for Gastroenterology Education: An Assessment of OpenAI's GPT-4 and GPT-3.5 in MKSAP Question Interpretation
    Patel, Akash
    Samreen, Isha
    Ahmed, Imran
    AMERICAN JOURNAL OF GASTROENTEROLOGY, 2024, 119 (10S): : S1580 - S1580
  • [45] Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board-style Examination
    Krishna, Satheesh
    Bhambra, Nishaant
    Bleakney, Robert
    Bhayana, Rajesh
    RADIOLOGY, 2024, 311 (02)
  • [46] Toward Improved Radiologic Diagnostics: Investigating the Utility and Limitations of GPT-3.5 Turbo and GPT-4 with Quiz Cases
    Kikuchi, Tomohiro
    Nakao, Takahiro
    Nakamura, Yuta
    Hanaoka, Shouhei
    Mori, Harushi
    Yoshikawa, Takeharu
    AMERICAN JOURNAL OF NEURORADIOLOGY, 2024, 45 (10) : 1506 - 1511
  • [47] Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study
    Sönmez Saglam
    Veysel Uludag
    Zekeriya Okan Karaduman
    Mehmet Arıcan
    Mücahid Osman Yücel
    Raşit Emin Dalaslan
    BMC Medical Informatics and Decision Making, 25 (1)
  • [48] Comment on: ‘Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination’ and ‘ChatGPT in ophthalmology: the dawn of a new era?’
    Nima Ghadiri
    Eye, 2024, 38 : 654 - 655
  • [49] ChatGPT, GPT-4, and Other Large Language Models: The Next Revolution for Clinical Microbiology?
    Egli, Adrian
    CLINICAL INFECTIOUS DISEASES, 2023, 77 (09) : 1322 - 1328
  • [50] Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments
    Beaulieu-Jones, Brendin R.
    Berrigan, Margaret T.
    Shah, Sahaj
    Marwaha, Jayson S.
    Lai, Shuo-Lun
    Brat, Gabriel A.
    SURGERY, 2024, 175 (04) : 936 - 942