AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study

被引:2
|
作者
Sadeq, Mohammed Ahmed [1 ,2 ,13 ]
Ghorab, Reem Mohamed Farouk [1 ,2 ,13 ]
Ashry, Mohamed Hady [2 ,3 ]
Abozaid, Ahmed Mohamed [2 ,4 ]
Banihani, Haneen A. [2 ,5 ]
Salem, Moustafa [2 ,6 ]
Aisheh, Mohammed Tawfiq Abu [2 ,7 ]
Abuzahra, Saad [2 ,7 ]
Mourid, Marina Ramzy [2 ,8 ]
Assker, Mohamad Monif [2 ,9 ]
Ayyad, Mohammed [2 ,10 ]
Moawad, Mostafa Hossam El Din [2 ,11 ,12 ]
机构
[1] Misr Univ Sci & Technol, 6th Of October City, Egypt
[2] Med Res Platform MRP, Giza, Egypt
[3] New Giza Univ NGU, Sch Med, Giza, Egypt
[4] Tanta Univ, Fac Med, Tanta, Egypt
[5] Univ Jordan, Fac Med, Amman, Jordan
[6] Mansoura Univ, Fac Med, Mansoura, Egypt
[7] Annajah Natl Univ, Coll Med & Hlth Sci, Dept Med, Nablus 44839, Palestine
[8] Alexandria Univ, Fac Med, Alexandria, Egypt
[9] Sheikh Khalifa Med City, Abu Dhabi, U Arab Emirates
[10] Al Quds Univ, Fac Med, Jerusalem, Palestine
[11] Alexandria Univ, Fac Pharm, Dept Clin, Alexandria, Egypt
[12] Suez Canal Univ, Fac Med, Ismailia, Egypt
[13] Elsheikh Zayed Specialized Hosp, Emergency Med Dept, Elsheikh Zayed City, Egypt
来源
SCIENTIFIC REPORTS | 2024年 / 14卷 / 01期
关键词
ARTIFICIAL-INTELLIGENCE;
D O I
10.1038/s41598-024-68996-2
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Large language models (LLMs) like ChatGPT have potential applications in medical education such as helping students study for their licensing exams by discussing unclear questions with them. However, they require evaluation on these complex tasks. The purpose of this study was to evaluate how well publicly accessible LLMs performed on simulated UK medical board exam questions. 423 board-style questions from 9 UK exams (MRCS, MRCP, etc.) were answered by seven LLMs (ChatGPT-3.5, ChatGPT-4, Bard, Perplexity, Claude, Bing, Claude Instant). There were 406 multiple-choice, 13 true/false, and 4 "choose N" questions covering topics in surgery, pediatrics, and other disciplines. The accuracy of the output was graded. Statistics were used to analyze differences among LLMs. Leaked questions were excluded from the primary analysis. ChatGPT 4.0 scored (78.2%), Bing (67.2%), Claude (64.4%), and Claude Instant (62.9%). Perplexity scored the lowest (56.1%). Scores differed significantly between LLMs overall (p < 0.001) and in pairwise comparisons. All LLMs scored higher on multiple-choice vs true/false or "choose N" questions. LLMs demonstrated limitations in answering certain questions, indicating refinements needed before primary reliance in medical education. However, their expanding capabilities suggest a potential to improve training if thoughtfully implemented. Further research should explore specialty specific LLMs and optimal integration into medical curricula.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Correction to: Introducing multiple-choice questions to promote learning for medical students: effect on exam performance in obstetrics and gynecology
    Sebastian M. Jud
    Susanne Cupisti
    Wolfgang Frobenius
    Andrea Winkler
    Franziska Schultheis
    Sophia Antoniadis
    Matthias W. Beckmann
    Felix Heindl
    Archives of Gynecology and Obstetrics, 2021, 304 : 1627 - 1627
  • [22] Utilization of, Perceptions on, and Intention to Use AI Chatbots Among Medical Students in China: National Cross-Sectional Study
    Tao, Wenjuan
    Yang, Jinming
    Qu, Xing
    JMIR MEDICAL EDUCATION, 2024, 10
  • [23] Improving performance on the annual resident in service exam: a comparative study of independent study, lectures, and examinations
    Tseng, Michael
    Heo, Moonseong
    Barmettler, Anne
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2020, 61 (07)
  • [24] A preliminary study of emotional intelligence, empathy and exam performance in first year medical students
    Austin, EJ
    Evans, P
    Goldwater, R
    Potter, V
    PERSONALITY AND INDIVIDUAL DIFFERENCES, 2005, 39 (08) : 1395 - 1405
  • [25] Impact of familiarity with the format of the exam on performance in the OSCE of undergraduate medical students - an interventional study
    Neuwirt, Hannes
    Eder, Iris E.
    Gauckler, Philipp
    Horvath, Lena
    Koeck, Stefan
    Noflatscher, Maria
    Schaefer, Benedikt
    Simeon, Anja
    Petzer, Verena
    Prodinger, Wolfgang M.
    Berendonk, Christoph
    BMC MEDICAL EDUCATION, 2024, 24 (01)
  • [26] Impact of familiarity with the format of the exam on performance in the OSCE of undergraduate medical students – an interventional study
    Hannes Neuwirt
    Iris E. Eder
    Philipp Gauckler
    Lena Horvath
    Stefan Koeck
    Maria Noflatscher
    Benedikt Schaefer
    Anja Simeon
    Verena Petzer
    Wolfgang M. Prodinger
    Christoph Berendonk
    BMC Medical Education, 24
  • [27] Study Finds People Prefer AI Over Clinician Responses to Questions in the Electronic Medical Record
    Perlis, Roy
    Collins, Nora
    JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2025, 333 (09): : 738 - 740
  • [28] Evaluating ChatGPT-4 in medical education: an assessment of subject exam performance reveals limitations in clinical curriculum support for students
    Mackey B.P.
    Garabet R.
    Maule L.
    Tadesse A.
    Cross J.
    Weingarten M.
    Discover Artificial Intelligence, 2024, 4 (01):
  • [29] Utility and Comparative Performance of Current Artificial Intelligence Large Language Models as Postoperative Medical Support Chatbots in Aesthetic Surgery
    Abi-Rafeh, Jad
    Henry, Nader
    Xu, Hong Hao
    Bassiri-Tehrani, Brian
    Arezki, Adel
    Kazan, Roy
    Gilardino, Mirko S.
    Nahai, Foad
    AESTHETIC SURGERY JOURNAL, 2024, 44 (08) : 889 - 896
  • [30] Comparative performance analysis of ChatGPT 3.5, ChatGPT 4.0 and Bard in answering common patient questions on melanoma<show/>
    Deliyannis, Eduardo Panaiotis
    Paul, Navreet
    Patel, Priya U.
    Papanikolaou, Marieta
    CLINICAL AND EXPERIMENTAL DERMATOLOGY, 2024, 49 (07) : 743 - 746