AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study

被引:2
|
作者
Sadeq, Mohammed Ahmed [1 ,2 ,13 ]
Ghorab, Reem Mohamed Farouk [1 ,2 ,13 ]
Ashry, Mohamed Hady [2 ,3 ]
Abozaid, Ahmed Mohamed [2 ,4 ]
Banihani, Haneen A. [2 ,5 ]
Salem, Moustafa [2 ,6 ]
Aisheh, Mohammed Tawfiq Abu [2 ,7 ]
Abuzahra, Saad [2 ,7 ]
Mourid, Marina Ramzy [2 ,8 ]
Assker, Mohamad Monif [2 ,9 ]
Ayyad, Mohammed [2 ,10 ]
Moawad, Mostafa Hossam El Din [2 ,11 ,12 ]
机构
[1] Misr Univ Sci & Technol, 6th Of October City, Egypt
[2] Med Res Platform MRP, Giza, Egypt
[3] New Giza Univ NGU, Sch Med, Giza, Egypt
[4] Tanta Univ, Fac Med, Tanta, Egypt
[5] Univ Jordan, Fac Med, Amman, Jordan
[6] Mansoura Univ, Fac Med, Mansoura, Egypt
[7] Annajah Natl Univ, Coll Med & Hlth Sci, Dept Med, Nablus 44839, Palestine
[8] Alexandria Univ, Fac Med, Alexandria, Egypt
[9] Sheikh Khalifa Med City, Abu Dhabi, U Arab Emirates
[10] Al Quds Univ, Fac Med, Jerusalem, Palestine
[11] Alexandria Univ, Fac Pharm, Dept Clin, Alexandria, Egypt
[12] Suez Canal Univ, Fac Med, Ismailia, Egypt
[13] Elsheikh Zayed Specialized Hosp, Emergency Med Dept, Elsheikh Zayed City, Egypt
来源
SCIENTIFIC REPORTS | 2024年 / 14卷 / 01期
关键词
ARTIFICIAL-INTELLIGENCE;
D O I
10.1038/s41598-024-68996-2
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Large language models (LLMs) like ChatGPT have potential applications in medical education such as helping students study for their licensing exams by discussing unclear questions with them. However, they require evaluation on these complex tasks. The purpose of this study was to evaluate how well publicly accessible LLMs performed on simulated UK medical board exam questions. 423 board-style questions from 9 UK exams (MRCS, MRCP, etc.) were answered by seven LLMs (ChatGPT-3.5, ChatGPT-4, Bard, Perplexity, Claude, Bing, Claude Instant). There were 406 multiple-choice, 13 true/false, and 4 "choose N" questions covering topics in surgery, pediatrics, and other disciplines. The accuracy of the output was graded. Statistics were used to analyze differences among LLMs. Leaked questions were excluded from the primary analysis. ChatGPT 4.0 scored (78.2%), Bing (67.2%), Claude (64.4%), and Claude Instant (62.9%). Perplexity scored the lowest (56.1%). Scores differed significantly between LLMs overall (p < 0.001) and in pairwise comparisons. All LLMs scored higher on multiple-choice vs true/false or "choose N" questions. LLMs demonstrated limitations in answering certain questions, indicating refinements needed before primary reliance in medical education. However, their expanding capabilities suggest a potential to improve training if thoughtfully implemented. Further research should explore specialty specific LLMs and optimal integration into medical curricula.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] ChatGPT-4 Performance on USMLE Step 1 Style Questions and Its Implications for Medical Education: A Comparative Study Across Systems and Disciplines
    Razmig Garabet
    Brendan P. Mackey
    James Cross
    Michael Weingarten
    Medical Science Educator, 2024, 34 : 145 - 152
  • [42] On e-business strategy planning and performance: a comparative study of the UK and Greece
    Lipitakis, Alexandra
    Phillips, Paul
    TECHNOLOGY ANALYSIS & STRATEGIC MANAGEMENT, 2016, 28 (03) : 266 - 289
  • [43] A comparative study on human or AI delivering negative performance feedback influencing employees' motivation to improve performance
    Wang, Guoxuan
    Long, Lirong
    Li, Shaolong
    Sun, Fang
    Wang, Jiaqing
    Huang, Shiyingzi
    ACTA PSYCHOLOGICA SINICA, 2025, 57 (02)
  • [44] Introducing multiple-choice questions to promote learning for medical students: effect on exam performance in obstetrics and gynecology (vol 302, pg 1401, 2020)
    Jud, Sebastian M.
    Cupisti, Susanne
    Frobenius, Wolfgang
    Winkler, Andrea
    Schultheis, Franziska
    Antoniadis, Sophia
    Beckmann, Matthias W.
    Heindl, Felix
    ARCHIVES OF GYNECOLOGY AND OBSTETRICS, 2021, 304 (06) : 1627 - 1627
  • [45] Unveiling the Potential of AI in Plastic Surgery Education: A Comparative Study of Leading AI Platforms' Performance on In-training Examinations
    DiDonna, Nicole
    Shetty, Pragna N.
    Khan, Kamran
    Damitz, Lynn
    PLASTIC AND RECONSTRUCTIVE SURGERY-GLOBAL OPEN, 2024, 12 (06)
  • [46] Correlation of online assessment parameters with summative exam performance in undergraduate medical education of pharmacology: a prospective cohort study
    Felizian Kühbeck
    Pascal O. Berberat
    Stefan Engelhardt
    Antonio Sarikas
    BMC Medical Education, 19
  • [47] Lessons on AI implementation from senior clinical practitioners: An exploratory qualitative study in medical imaging and radiotherapy in the UK
    Stogiannos, Nikolaos
    O'Regan, Tracy
    Scurr, Erica
    Litosseliti, Lia
    Pogose, Michael
    Harvey, Hugh
    Kumar, Amrita
    Malik, Rizwan
    Barnes, Anna
    Mcentee, Mark F.
    Malamateniou, Christina
    JOURNAL OF MEDICAL IMAGING AND RADIATION SCIENCES, 2025, 56 (01)
  • [48] Exploring the Dynamics of Medical Students' Exam Performance in Relation to Study Habits, Growth Mindset, Confidence Levels, and Demographics
    Patenaude, Bart
    Hogan, Rachel E.
    Manguvo, Angellar
    MEDICAL SCIENCE EDUCATOR, 2024, 34 (02) : 371 - 378
  • [49] A Controlled Study of Improvements in Student Exam Performance With the Use of an Audience Response System During Medical School Lectures
    Stoddard, Hugh A.
    Piquette, Craig A.
    ACADEMIC MEDICINE, 2010, 85 : S37 - S40
  • [50] Exploring the Dynamics of Medical Students’ Exam Performance in Relation to Study Habits, Growth Mindset, Confidence Levels, and Demographics
    Bart Patenaude
    Rachel E. Hogan
    Angellar Manguvo
    Medical Science Educator, 2024, 34 : 371 - 378