ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study

被引:1
|
作者
Arvidsson, Rasmus [1 ,2 ]
Gunnarsson, Ronny [1 ,3 ]
Entezarjou, Artin [1 ]
Sundemo, David [1 ,4 ]
Wikberg, Carl [1 ,5 ]
机构
[1] Univ Gothenburg, Sahlgrenska Acad, Sch Publ Hlth & Community Med, Gen Practice,Family Med,Inst Med, Gothenburg, Sweden
[2] Praktikertjanst AB, Halsocentralen Sankt Hans, Lund, Sweden
[3] Reg Vastra Gotaland, Narhalsan, Vardcentralen Hemlosa, Gothenburg, Sweden
[4] Lerum Primary Healthcare Ctr, Narhalsan, Lerum, Sweden
[5] Reg Vastra Gotaland, Primary Hlth Care, Res Educ Dev & Innovat, Gothenburg, Sweden
来源
BMJ OPEN | 2024年 / 14卷 / 12期
关键词
Artificial Intelligence; Primary Health Care; Health informatics;
D O I
10.1136/bmjopen-2024-086148
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Background Recent breakthroughs in artificial intelligence research include the development of generative pretrained transformers (GPT). ChatGPT has been shown to perform well when answering several sets of medical multiple-choice questions. However, it has not been tested for writing free-text assessments of complex cases in primary care. Objectives To compare the performance of ChatGPT, version GPT-4, with that of real doctors. Design and setting A blinded observational comparative study conducted in the Swedish primary care setting. Responses from GPT-4 and real doctors to cases from the Swedish family medicine specialist examination were scored by blinded reviewers, and the scores were compared. Participants Anonymous responses from the Swedish family medicine specialist examination 2017-2022 were used. Outcome measures Primary: the mean difference in scores between GPT-4's responses and randomly selected responses by human doctors, as well as between GPT-4's responses and top-tier responses by human doctors. Secondary: the correlation between differences in response length and response score; the intraclass correlation coefficient between reviewers; and the percentage of maximum score achieved by each group in different subject categories. Results The mean scores were 6.0, 7.2 and 4.5 for randomly selected doctor responses, top-tier doctor responses and GPT-4 responses, respectively, on a 10-point scale. The scores for the random doctor responses were, on average, 1.6 points higher than those of GPT-4 (p<0.001, 95% CI 0.9 to 2.2) and the top-tier doctor scores were, on average, 2.7 points higher than those of GPT-4 (p<0.001, 95 % CI 2.2 to 3.3). Following the release of GPT-4o, the experiment was repeated, although this time with only a single reviewer scoring the answers. In this follow-up, random doctor responses were scored 0.7 points higher than those of GPT-4o (p=0.044). Conclusion In complex primary care cases, GPT-4 performs worse than human doctors taking the family medicine specialist examination. Future GPT-based chatbots may perform better, but comprehensive evaluations are needed before implementing chatbots for medical decision support in primary care.
引用
收藏
页数:6
相关论文
共 8 条
  • [1] ChatGPT and Patient Information in Nuclear Medicine: GPT-3.5 Versus GPT-4
    Currie, Geoff
    Robbie, Stephanie
    Tually, Peter
    JOURNAL OF NUCLEAR MEDICINE TECHNOLOGY, 2023, 51 (04) : 307 - 313
  • [2] Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination
    Liu, Chiu-Liang
    Ho, Chien-Ta
    Wu, Tzu-Chi
    HEALTHCARE, 2024, 12 (17)
  • [3] Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study
    Meyer, Annika
    Riese, Janik
    Streichert, Thomas
    JMIR MEDICAL EDUCATION, 2024, 10
  • [4] Performance Evaluation of the Generative Pre-trained Transformer (GPT-4) on the Family Medicine In-Training Examination
    Wang, Ting
    Mainous III, Arch G.
    Stelter, Keith
    O'Neill, Thomas R.
    Newton, Warren P.
    JOURNAL OF THE AMERICAN BOARD OF FAMILY MEDICINE, 2024, 37 (04) : 528 - 582
  • [5] Performance of large language models in the National Dental Licensing Examination in China: a comparative analysis of ChatGPT, GPT-4, and New Bing
    Hu, Ziyang
    Xu, Zhe
    Shi, Ping
    Zhang, Dandan
    Yue, Qu
    Zhang, Jiexia
    Lei, Xin
    Lin, Zitong
    INTERNATIONAL JOURNAL OF COMPUTERIZED DENTISTRY, 2024, 27 (04)
  • [6] The performance of ChatGPT on orthopaedic in-service training exams: A comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education
    Rizzo, Michael G.
    Cai, Nathan
    Constantinescu, David
    JOURNAL OF ORTHOPAEDICS, 2024, 50 : 70 - 75
  • [7] ChatGPT (GPT-4) passed the Japanese National License Examination for Pharmacists in 2022, answering all items including those with diagrams: a descriptive study
    Sato, Hiroyasu
    Ogasawara, Katsuhiko
    JOURNAL OF EDUCATIONAL EVALUATION FOR HEALTH PROFESSIONS, 2024, 21
  • [8] Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study
    Goodings, Anthony James
    Kajitani, Sten
    Chhor, Allison
    Albakri, Ahmad
    Pastrak, Mila
    Kodancha, Megha
    Ives, Rowan
    Bin Lee, Yoo
    Kajitani, Kari
    JMIR MEDICAL EDUCATION, 2024, 10