AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study

被引:2
|
作者
Sadeq, Mohammed Ahmed [1 ,2 ,13 ]
Ghorab, Reem Mohamed Farouk [1 ,2 ,13 ]
Ashry, Mohamed Hady [2 ,3 ]
Abozaid, Ahmed Mohamed [2 ,4 ]
Banihani, Haneen A. [2 ,5 ]
Salem, Moustafa [2 ,6 ]
Aisheh, Mohammed Tawfiq Abu [2 ,7 ]
Abuzahra, Saad [2 ,7 ]
Mourid, Marina Ramzy [2 ,8 ]
Assker, Mohamad Monif [2 ,9 ]
Ayyad, Mohammed [2 ,10 ]
Moawad, Mostafa Hossam El Din [2 ,11 ,12 ]
机构
[1] Misr Univ Sci & Technol, 6th Of October City, Egypt
[2] Med Res Platform MRP, Giza, Egypt
[3] New Giza Univ NGU, Sch Med, Giza, Egypt
[4] Tanta Univ, Fac Med, Tanta, Egypt
[5] Univ Jordan, Fac Med, Amman, Jordan
[6] Mansoura Univ, Fac Med, Mansoura, Egypt
[7] Annajah Natl Univ, Coll Med & Hlth Sci, Dept Med, Nablus 44839, Palestine
[8] Alexandria Univ, Fac Med, Alexandria, Egypt
[9] Sheikh Khalifa Med City, Abu Dhabi, U Arab Emirates
[10] Al Quds Univ, Fac Med, Jerusalem, Palestine
[11] Alexandria Univ, Fac Pharm, Dept Clin, Alexandria, Egypt
[12] Suez Canal Univ, Fac Med, Ismailia, Egypt
[13] Elsheikh Zayed Specialized Hosp, Emergency Med Dept, Elsheikh Zayed City, Egypt
来源
SCIENTIFIC REPORTS | 2024年 / 14卷 / 01期
关键词
ARTIFICIAL-INTELLIGENCE;
D O I
10.1038/s41598-024-68996-2
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Large language models (LLMs) like ChatGPT have potential applications in medical education such as helping students study for their licensing exams by discussing unclear questions with them. However, they require evaluation on these complex tasks. The purpose of this study was to evaluate how well publicly accessible LLMs performed on simulated UK medical board exam questions. 423 board-style questions from 9 UK exams (MRCS, MRCP, etc.) were answered by seven LLMs (ChatGPT-3.5, ChatGPT-4, Bard, Perplexity, Claude, Bing, Claude Instant). There were 406 multiple-choice, 13 true/false, and 4 "choose N" questions covering topics in surgery, pediatrics, and other disciplines. The accuracy of the output was graded. Statistics were used to analyze differences among LLMs. Leaked questions were excluded from the primary analysis. ChatGPT 4.0 scored (78.2%), Bing (67.2%), Claude (64.4%), and Claude Instant (62.9%). Perplexity scored the lowest (56.1%). Scores differed significantly between LLMs overall (p < 0.001) and in pairwise comparisons. All LLMs scored higher on multiple-choice vs true/false or "choose N" questions. LLMs demonstrated limitations in answering certain questions, indicating refinements needed before primary reliance in medical education. However, their expanding capabilities suggest a potential to improve training if thoughtfully implemented. Further research should explore specialty specific LLMs and optimal integration into medical curricula.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Effectiveness of AI-powered Chatbots in responding to orthopaedic postgraduate exam questions-an observational study
    Vaishya, Raju
    Iyengar, Karthikeyan P.
    Patralekh, Mohit Kumar
    Botchu, Rajesh
    Shirodkar, Kapil
    Jain, Vijay Kumar
    Vaish, Abhishek
    Scarlat, Marius M.
    INTERNATIONAL ORTHOPAEDICS, 2024, 48 (08) : 1963 - 1969
  • [2] Comparative Assessment of AI-Driven Chatbots for Ophthalmic Patient Information: Inaccuracies in Chatbots' Answers to Patient Questions
    Oca, Michael
    Parikh, Alomi
    Conger, Jordan
    Mccoy, Allison
    Meller, Leo
    Wilson, Katherine
    Zhang-Nunes, Sandy
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2024, 65 (07)
  • [3] AI Chatbots in Oncology: A Comparative Study of Sider Fusion AI and Perplexity AI for Gastric Cancer Patients
    Naseri, Amirhosein
    Antikchi, Mohammad Hossein
    Barahman, Maedeh
    Shirinzadeh-Dastgiri, Ahmad
    Haghighikian, Seyed Masoud
    Vakili-Ojarood, Mohammad
    Rahmani, Amirhossein
    Shahbazi, Amirhossein
    Shiri, Amirmasoud
    Masoudi, Ali
    Aghasipour, Maryam
    Aghili, Kazem
    Neamatzadeh, Hossein
    INDIAN JOURNAL OF SURGICAL ONCOLOGY, 2024,
  • [5] Assessing the performance of AI chatbots in answering patients' common questions about low back pain
    Scaff, Simone P. S.
    Reis, Felipe J. J.
    Ferreira, Giovanni E.
    Jacob, Maria Fernanda
    Saragiotto, Bruno T.
    ANNALS OF THE RHEUMATIC DISEASES, 2025, 84 (01) : 143 - 149
  • [6] Assessing the performance of AI chatbots in answering patients' common questions about low back pain
    Scaff, Simone P. S.
    Reis, Felipe J. J.
    Ferreira, Giovanni E.
    Jacob, Maria Fernanda
    Saragiotto, Bruno T.
    ANNALS OF THE RHEUMATIC DISEASES, 2025, 84 (01) : 143 - 149
  • [7] Battle of the bots: a comparative analysis of generative ai responses from leading chatbots to patient questions about endometriosis
    Cohen, N.
    Kho, K.
    Smith, K.
    AMERICAN JOURNAL OF OBSTETRICS AND GYNECOLOGY, 2024, 230 (04) : S1170 - S1170
  • [8] Assessing AI efficacy in medical knowledge tests: A study using Taiwan's internal medicine exam questions from 2020 to 2023
    Lin, Shih-Yi
    Hsu, Ying-Yu
    Ju, Shu-Woei
    Yeh, Pei-Chun
    Hsu, Wu-Huei
    Kao, Chia-Hung
    DIGITAL HEALTH, 2024, 10
  • [9] ChatGPT performance in the medical specialty exam: An observational study
    Oztermeli, Ayse Dilara
    Oztermeli, Ahmet
    MEDICINE, 2023, 102 (32) : E34673
  • [10] Comparative assessment of artificial intelligence chatbots' performance in responding to healthcare professionals' and caregivers' questions about Dravet syndrome
    Jesus-Ribeiro, Joana
    Roza, Eugenia
    Oliveiros, Barbara
    Melo, Joana Barbosa
    Carreno, Mar
    EPILEPSIA OPEN, 2025,