The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: An analysis of ChatGPT 3.5, ChatGPT 4, and Bard

被引:6
|
作者
Agharia, Suzen [1 ]
Szatkowski, Jan [2 ]
Fraval, Andrew [1 ]
Stevens, Jarrad [1 ]
Zhou, Yushy [1 ,3 ]
机构
[1] St Vincents Hosp, Dept Orthopaed Surg, Melbourne, Vic, Australia
[2] Indiana Univ Hlth Methodist Hosp, Dept Orthopaed Surg, Indianapolis, IN USA
[3] Level 2,Clin Sci Bldg,29 Regent St, Fitzroy, Vic 3065, Australia
关键词
AI; CHALLENGES; QUESTIONS;
D O I
10.1016/j.jor.2023.11.063
中图分类号
R826.8 [整形外科学]; R782.2 [口腔颌面部整形外科学]; R726.2 [小儿整形外科学]; R62 [整形外科学(修复外科学)];
学科分类号
摘要
Background: Recent advancements in artificial intelligence (AI) have sparked interest in its integration into clinical medicine and education. This study evaluates the performance of three AI tools compared to human clinicians in addressing complex orthopaedic decisions in real-world clinical cases.Questions/purposes: To evaluate the ability of commonly used AI tools to formulate orthopaedic clinical decisions in comparison to human clinicians.Patients and methods: The study used OrthoBullets Cases, a publicly available clinical cases collaboration platform where surgeons from around the world choose treatment options based on peer-reviewed standardised treatment polls. The clinical cases cover various orthopaedic categories. Three AI tools, (ChatGPT 3.5, ChatGPT 4, and Bard), were evaluated. Uniform prompts were used to input case information including questions relating to the case, and the AI tools' responses were analysed for alignment with the most popular response, within 10%, and within 20% of the most popular human responses.Results: In total, 8 clinical categories comprising of 97 questions were analysed. ChatGPT 4 demonstrated the highest proportion of most popular responses (pro-portion of most popular response: ChatGPT 4 68.0%, ChatGPT 3.5 40.2%, Bard 45.4%, P value < 0.001), outperforming other AI tools. AI tools performed poorer in questions that were considered controversial (where disagreement occurred in human responses). Inter-tool agreement, as evaluated using Cohen's kappa coefficient, ranged from 0.201 (ChatGPT 4 vs. Bard) to 0.634 (ChatGPT 3.5 vs. Bard). However, AI tool responses varied widely, reflecting a need for consistency in real-world clinical applications.Conclusions: While AI tools demonstrated potential use in educational contexts, their integration into clinical decision-making requires caution due to inconsistent responses and deviations from peer consensus. Future research should focus on specialised clinical AI tool development to maximise utility in clinical decision -making.Level of evidence: IV.
引用
收藏
页码:1 / 7
页数:7
相关论文
共 50 条
  • [1] Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions
    Sallam, Malik
    Al-Salahat, Khaled
    Eid, Huda
    Egger, Jan
    Puladi, Behrus
    ADVANCES IN MEDICAL EDUCATION AND PRACTICE, 2024, 15 : 857 - 871
  • [2] Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations
    Massey, Patrick A.
    Montgomery, Carver
    Zhang, Andrew S.
    JOURNAL OF THE AMERICAN ACADEMY OF ORTHOPAEDIC SURGEONS, 2023, 31 (23) : 1173 - 1179
  • [3] Artificial intelligence in hepatology: a comparative analysis of ChatGPT-4, Bing, and Bard at answering clinical questions
    Anvari, Sama
    Lee, Yung
    Jin, David Shiqiang
    Malone, Sarah
    Collins, Matthew
    JOURNAL OF THE CANADIAN ASSOCIATION OF GASTROENTEROLOGY, 2025,
  • [4] Chatbots Put to the Test in Math and Logic Problems: A Comparison and Assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard
    Plevris, Vagelis
    Papazafeiropoulos, George
    Rios, Alejandro Jimenez
    AI, 2023, 4 (04) : 949 - 969
  • [5] CAN ARTIFICIAL INTELLIGENCE ASSESS SUICIDE RISK IN YOUTH?: COMPARING CHATGPT-3.5, CHATGPT-4, AND CHATGPT-4O TO PSYCHIATRISTS
    Nguyen, Lily T.
    Tran, Viet T.
    Mathesh, Vivek
    Tran, Jessica T.
    Ahmed, Youssef
    Liu-Zarzuela, Jasmine A.
    Oorjitham, Navin S.
    JOURNAL OF THE AMERICAN ACADEMY OF CHILD AND ADOLESCENT PSYCHIATRY, 2024, 63 (10): : S181 - S181
  • [6] Evaluating the Sensitivity, Specificity, and Accuracy of ChatGPT-3.5, ChatGPT-4, Bing AI, and Bard Against Conventional Drug-Drug Interactions Clinical Tools
    Al-Ashwal, Fahmi Y.
    Zawiah, Mohammed
    Gharaibeh, Lobna
    Abu-Farha, Rana
    Bitar, Ahmad Naoras
    DRUG HEALTHCARE AND PATIENT SAFETY, 2023, 15 : 137 - 147
  • [7] The comparison of the "opinion" of ChatGPT and Bard artificial intelligence chatbots on critical transplant issues with emphasis on bioethics
    Tsoulfas, G.
    Karageorgos, F.
    Karakasi, K. -E.
    Vasileiadou, S.
    Kofinas, A.
    Tsakiris, G.
    Katsanos, G.
    Antoniadis, N.
    LIVER TRANSPLANTATION, 2024, 30 : 201 - 201
  • [8] Clinical questions on advanced life support answered by artificial intelligence. A comparison between ChatGPT, Google Bard and Microsoft Copilot
    Semeraro, Federico
    Gamberini, Lorenzo
    Carmona, Francesc
    Monsieurs, Koenraad G.
    RESUSCITATION, 2024, 195
  • [9] THE ABILITY OF ARTIFICIAL INTELLIGENCE CHATBOTS ChatGPT AND GOOGLE BARD TO ACCURATELY CONVEY PREOPERATIVE INFORMATION FOR PATIENTS UNDERGOING OPHTHALMIC SURGERIES
    Patil, Nikhil S.
    Huang, Ryan
    Mihalache, Andrew
    Kisilevsky, Eli
    Kwok, Jason
    Popovic, Marko M.
    Nassrallah, Georges
    Chan, Clara
    Mallipatna, Ashwin
    Kertes, Peter J.
    Muni, Rajeev H.
    RETINA-THE JOURNAL OF RETINAL AND VITREOUS DISEASES, 2024, 44 (06): : 950 - 953
  • [10] An evaluation of orthodontic information quality regarding artificial intelligence (AI) chatbot technologies: A comparison of ChatGPT and google BARD
    Arslan, Can
    Kahya, Kaan
    Cesur, Emre
    Cakan, Derya Germec
    AUSTRALASIAN ORTHODONTIC JOURNAL, 2024, 40 (01): : 149 - 157