The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: An analysis of ChatGPT 3.5, ChatGPT 4, and Bard

被引：6

作者：

Agharia, Suzen ^{[1
]}

Szatkowski, Jan ^{[2
]}

Fraval, Andrew ^{[1
]}

Stevens, Jarrad ^{[1
]}

Zhou, Yushy ^{[1
,3
]}

机构：

[1] St Vincents Hosp, Dept Orthopaed Surg, Melbourne, Vic, Australia

[2] Indiana Univ Hlth Methodist Hosp, Dept Orthopaed Surg, Indianapolis, IN USA

[3] Level 2,Clin Sci Bldg,29 Regent St, Fitzroy, Vic 3065, Australia

来源：

JOURNAL OF ORTHOPAEDICS | 2024年 / 50卷

关键词：

AI; CHALLENGES; QUESTIONS;

D O I：

10.1016/j.jor.2023.11.063

中图分类号：

R826.8 [整形外科学]; R782.2 [口腔颌面部整形外科学]; R726.2 [小儿整形外科学]; R62 [整形外科学（修复外科学）];

学科分类号：

摘要：

Background: Recent advancements in artificial intelligence (AI) have sparked interest in its integration into clinical medicine and education. This study evaluates the performance of three AI tools compared to human clinicians in addressing complex orthopaedic decisions in real-world clinical cases.Questions/purposes: To evaluate the ability of commonly used AI tools to formulate orthopaedic clinical decisions in comparison to human clinicians.Patients and methods: The study used OrthoBullets Cases, a publicly available clinical cases collaboration platform where surgeons from around the world choose treatment options based on peer-reviewed standardised treatment polls. The clinical cases cover various orthopaedic categories. Three AI tools, (ChatGPT 3.5, ChatGPT 4, and Bard), were evaluated. Uniform prompts were used to input case information including questions relating to the case, and the AI tools' responses were analysed for alignment with the most popular response, within 10%, and within 20% of the most popular human responses.Results: In total, 8 clinical categories comprising of 97 questions were analysed. ChatGPT 4 demonstrated the highest proportion of most popular responses (pro-portion of most popular response: ChatGPT 4 68.0%, ChatGPT 3.5 40.2%, Bard 45.4%, P value < 0.001), outperforming other AI tools. AI tools performed poorer in questions that were considered controversial (where disagreement occurred in human responses). Inter-tool agreement, as evaluated using Cohen's kappa coefficient, ranged from 0.201 (ChatGPT 4 vs. Bard) to 0.634 (ChatGPT 3.5 vs. Bard). However, AI tool responses varied widely, reflecting a need for consistency in real-world clinical applications.Conclusions: While AI tools demonstrated potential use in educational contexts, their integration into clinical decision-making requires caution due to inconsistent responses and deviations from peer consensus. Future research should focus on specialised clinical AI tool development to maximise utility in clinical decision -making.Level of evidence: IV.

引用

页码：1 / 7

页数：7

共 50 条

[41] Comment on: ‘Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination’ and ‘ChatGPT in ophthalmology: the dawn of a new era?’
Nima Ghadiri
Eye, 2024, 38 : 654 - 655
[42] Can large language models pass official high-grade exams of the European Society of Neuroradiology courses? A direct comparison between OpenAI chatGPT 3.5, OpenAI GPT4 and Google Bard
D'Anna, Gennaro
Van Cauter, Sofie
Thurnher, Majda
Van Goethem, Johan
Haller, Sven
NEURORADIOLOGY, 2024, 66 (08) : 1245 - 1250
[43] Artificial Intelligence-Based Chatbots' Ability to Interpret Mammography Images: A Comparison of Chat-GPT 4o and Claude 3.5
Karahan, Betul Nalan
Emekli, Emre
Altin, Mahmut Altug
EUROPEAN JOURNAL OF THERAPEUTICS, 2025, 31 (01): : 28 - 34
[44] Self-Captured Images Recognition by Artificial Intelligence (AI) in Common Nephrology Medications: A Comparative Analysis of ChatGPT-4 and Claude 3 Opus
Sheikh, M. Salman
Dreesman, Benjamin
Barreto, Erin F.
Miao Jing
Thongprayoon, Charat
Qureshi, Fawad
Craici, Iasmina
Kashani, Kianoush
Cheungpasitporn, Wisit
JOURNAL OF THE AMERICAN SOCIETY OF NEPHROLOGY, 2024, 35 (10):
[45] Advancing Artificial Intelligence for Clinical Knowledge Retrieval: A Case Study Using ChatGPT-4 and Link Retrieval Plug-In to Analyze Diabetic Ketoacidosis Guidelines
Hamed, Ehab
Sharif, Anna
Eid, Ahmad
Alfehaidi, Alanoud
Alberry, Medhat
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (07)
[46] Artificial Intelligence Tools and Bias in Journalism-related Content Generation: Comparison Between Chat GPT3.5, GPT-4 and Bing
Castillo-Campos, Mar
Varona-Aramburu, David
Becerra-Alonso, David
TRIPODOS, 2024, (55): : 99 - 115
[47] Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions
Moshirfar, Majid
Altaf, Amal W.
Stoakes, Isabella M.
Tuttle, Jared J.
Hoopes, Phillip C.
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (06)
[48] Quality of Information Provided by Artificial Intelligence Chatbots Surrounding the Reconstructive Surgery for Head and Neck Cancer: A Comparative Analysis Between ChatGPT4 and Claude2
Boscolo-Rizzo, Paolo
Marcuzzo, Alberto Vito
Lazzarin, Chiara
Giudici, Fabiola
Polesel, Jerry
Stellin, Marco
Pettorelli, Andrea
Spinato, Giacomo
Ottaviano, Giancarlo
Ferrari, Marco
Borsetto, Daniele
Zucchini, Simone
Trabalzini, Franco
Sia, Egidio
Gardenal, Nicoletta
Baruca, Roberto
Fortunati, Alfonso
Vaira, Luigi Angelo
Tirelli, Giancarlo
CLINICAL OTOLARYNGOLOGY, 2025, 50 (02) : 330 - 335
[49] Artificial Intelligence (ChatGPT-4o) in Adjuvant Treatment Decision-Making for Stage II Colon Cancer: A Comparative Analysis with Clinician Recommendations and NCCN/ESMO Guidelines
Kus, Fatih
Chalabiyev, Elvin
Yildirim, Hasan Cagri
Koc Kus, Ilgin
Sirvan, Firat
Dizdar, Omer
Yalcin, Suayib
UHOD-ULUSLARARASI HEMATOLOJI-ONKOLOJI DERGISI, 2025, 35 (01): : 68 - 74
[50] Evaluating Artificial Intelligence in Spinal Cord Injury Management: A Comparative Analysis of ChatGPT-4o and Google Gemini Against American College of Surgeons Best Practices Guidelines for Spine Injury
Yu, Alexander
Li, Albert
Ahmed, Wasil
Saturno, Michael
Cho, Samuel K.
GLOBAL SPINE JOURNAL, 2025,

← 1 2 3 4 5 →