The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: An analysis of ChatGPT 3.5, ChatGPT 4, and Bard

被引:6
|
作者
Agharia, Suzen [1 ]
Szatkowski, Jan [2 ]
Fraval, Andrew [1 ]
Stevens, Jarrad [1 ]
Zhou, Yushy [1 ,3 ]
机构
[1] St Vincents Hosp, Dept Orthopaed Surg, Melbourne, Vic, Australia
[2] Indiana Univ Hlth Methodist Hosp, Dept Orthopaed Surg, Indianapolis, IN USA
[3] Level 2,Clin Sci Bldg,29 Regent St, Fitzroy, Vic 3065, Australia
关键词
AI; CHALLENGES; QUESTIONS;
D O I
10.1016/j.jor.2023.11.063
中图分类号
R826.8 [整形外科学]; R782.2 [口腔颌面部整形外科学]; R726.2 [小儿整形外科学]; R62 [整形外科学(修复外科学)];
学科分类号
摘要
Background: Recent advancements in artificial intelligence (AI) have sparked interest in its integration into clinical medicine and education. This study evaluates the performance of three AI tools compared to human clinicians in addressing complex orthopaedic decisions in real-world clinical cases.Questions/purposes: To evaluate the ability of commonly used AI tools to formulate orthopaedic clinical decisions in comparison to human clinicians.Patients and methods: The study used OrthoBullets Cases, a publicly available clinical cases collaboration platform where surgeons from around the world choose treatment options based on peer-reviewed standardised treatment polls. The clinical cases cover various orthopaedic categories. Three AI tools, (ChatGPT 3.5, ChatGPT 4, and Bard), were evaluated. Uniform prompts were used to input case information including questions relating to the case, and the AI tools' responses were analysed for alignment with the most popular response, within 10%, and within 20% of the most popular human responses.Results: In total, 8 clinical categories comprising of 97 questions were analysed. ChatGPT 4 demonstrated the highest proportion of most popular responses (pro-portion of most popular response: ChatGPT 4 68.0%, ChatGPT 3.5 40.2%, Bard 45.4%, P value < 0.001), outperforming other AI tools. AI tools performed poorer in questions that were considered controversial (where disagreement occurred in human responses). Inter-tool agreement, as evaluated using Cohen's kappa coefficient, ranged from 0.201 (ChatGPT 4 vs. Bard) to 0.634 (ChatGPT 3.5 vs. Bard). However, AI tool responses varied widely, reflecting a need for consistency in real-world clinical applications.Conclusions: While AI tools demonstrated potential use in educational contexts, their integration into clinical decision-making requires caution due to inconsistent responses and deviations from peer consensus. Future research should focus on specialised clinical AI tool development to maximise utility in clinical decision -making.Level of evidence: IV.
引用
收藏
页码:1 / 7
页数:7
相关论文
共 50 条
  • [21] Comparative analysis of artificial intelligence-driven assistance in diverse educational queries: ChatGPT vs. Google Bard
    Al Mashagbeh, Mohammad
    Dardas, Latefa
    Alzaben, Heba
    Alkhayat, Amjad
    FRONTIERS IN EDUCATION, 2024, 9
  • [22] Utilizing Artificial Intelligence-Based Tools for Addressing Clinical Queries: ChatGPT Versus Google Gemini
    Labrague, Leodoro J.
    JOURNAL OF NURSING EDUCATION, 2024, 63 (08) : 556 - 559
  • [23] Readability, quality and accuracy of generative artificial intelligence chatbots for commonly asked questions about labor epidurals: a comparison of ChatGPT and Bard
    Lee, D.
    Brown, M.
    Hammond, J.
    Zakowski, M.
    INTERNATIONAL JOURNAL OF OBSTETRIC ANESTHESIA, 2025, 61
  • [24] Exploring the use of generative artificial intelligence in systematic searching: A comparative case study of a human librarian, ChatGPT-4 and ChatGPT-4 Turbo
    Chen, Xiayu Summer
    Feng, Yali
    IFLA JOURNAL-INTERNATIONAL FEDERATION OF LIBRARY ASSOCIATIONS, 2024,
  • [25] Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations
    Lee, Yung
    Shin, Thomas
    Tessier, Lea
    Javidan, Arshia
    Jung, James
    Hong, Dennis
    Strong, Andrew T.
    McKechnie, Tyler
    Malone, Sarah
    Jin, David
    Kroh, Matthew
    Dang, Jerry T.
    SURGERY FOR OBESITY AND RELATED DISEASES, 2024, 20 (07) : 603 - 608
  • [26] Evaluation of the Current Status of Artificial Intelligence for Endourology Patient Education: A Blind Comparison of ChatGPT and Google Bard Against Traditional Information Resources
    Connors, Christopher
    Gupta, Kavita
    Khusid, Johnathan A.
    Khargi, Raymond
    Yaghoubian, Alan J.
    Levy, Micah
    Gallante, Blair
    Atallah, William
    Gupta, Mantu
    JOURNAL OF ENDOUROLOGY, 2024, 38 (08) : 843 - 851
  • [27] Comparison of ChatGPT version 3.5 & 4 for utility in respiratory medicine education using clinical case scenarios
    Balasanjeevi, Gayathri
    Surapaneni, Krishna Mohan
    RESPIRATORY MEDICINE AND RESEARCH, 2024, 85
  • [28] Comparative Analysis of Artificial Intelligence Platforms: ChatGPT-3.5 and GoogleBard in Identifying Red Flags of Low Back Pain
    Muluk, Selkin Yilmaz
    Olcucu, Nazli
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (07)
  • [29] Distinguishing ChatGPT(-3.5, -4)-generated and human-written papers through Japanese stylometric analysis
    Zaitsu, Wataru
    Jin, Mingzhe
    arXiv, 2023,
  • [30] Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions
    Lee, Yung
    Brar, Karanbir
    Malone, Sarah
    Jin, David
    McKechnie, Tyler
    Jung, James J.
    Kroh, Matthew
    Dang, Jerry T.
    SURGERY FOR OBESITY AND RELATED DISEASES, 2024, 20 (07) : 609 - 613