Performance of ChatGPT-4 in answering questions from the Brazilian National Examination for Medical Degree Revalidation

被引：16

作者：

Gobira, Mauro ^{[1
]}

Nakayama, Luis Filipe ^{[2
,3
]}

Moreira, Rodrigo ^{[1
]}

Andrade, Eric ^{[2
]}

Regatieri, Caio Vinicius Saito ^{[2
]}

Belfort Jr, Rubens ^{[2
]}

机构：

[1] Vis Inst, Inst Paulista Estudos & Pesquisas Oftalmol, Sao Paulo, SP, Brazil

[2] Univ Fed Sao Paulo, Dept Ophthalmol, Sao Paulo, SP, Brazil

[3] MIT, Inst Med Engn & Sci, Cambridge, MA 02142 USA

来源：

REVISTA DA ASSOCIACAO MEDICA BRASILEIRA | 2023年 / 69卷 / 10期

关键词：

Artificial intelligence; Education; Natural language processing;

D O I：

10.1590/1806-9282.20230848

中图分类号：

R5 [内科学];

学科分类号：

1002 ; 100201 ;

摘要：

OBJECTIVE: The aim of this study was to evaluate the performance of ChatGPT-4.0 in answering the 2022 Brazilian National Examination for Medical Degree Revalidation (Revalida) and as a tool to provide feedback on the quality of the examination. METHODS: A total of two independent physicians entered all examination questions into ChatGPT-4.0. After comparing the outputs with the test solutions, they classified the large language model answers as adequate, inadequate, or indeterminate. In cases of disagreement, they adjudicated and achieved a consensus decision on the ChatGPT accuracy. The performance across medical themes and nullified questions was compared using chi-square statistical analysis. RESULTS: In the Revalida examination, ChatGPT-4.0 answered 71 (87.7%) questions correctly and 10 (12.3%) incorrectly. There was no statistically significant difference in the proportions of correct answers among different medical themes (p=0.4886). The artificial intelligence model had a lower accuracy of 71.4% in nullified questions, with no statistical difference (p=0.241) between non-nullified and nullified groups. CONCLUSION: ChatGPT-4.0 showed satisfactory performance for the 2022 Brazilian National Examination for Medical Degree Revalidation. The large language model exhibited worse performance on subjective questions and public healthcare themes. The results of this study suggested that the overall quality of the Revalida examination questions is satisfactory and corroborates the nullified questions.

引用

页数：5

共 50 条

[31] ChatGPT-4's capability in addressing multiple-choice questions within the primary examination of the Australian and New Zealand College of Anaesthetists
Cai, Steven C.
Tung, Alpha M. S.
ANAESTHESIA AND INTENSIVE CARE, 2025, 53 (01) : 70 - 74
[32] The performance of OpenAI ChatGPT-4 and Google Gemini in virology multiple-choice questions: a comparative analysis of English and Arabic responses
Sallam, Malik
Al-Mahzoum, Kholoud
Almutawaa, Rawan Ahmad
Alhashash, Jasmen Ahmad
Dashti, Retaj Abdullah
Alsafy, Danah Raed
Almutairi, Reem Abdullah
Barakat, Muna
BMC RESEARCH NOTES, 2024, 17 (01)
[33] Comparative Performance of ChatGPT 3.5 and GPT4 on Rhinology Standardized Board Examination Questions
Patel, Evan A.
Fleischer, Lindsay
Filip, Peter
Eggerstedt, Michael
Hutz, Michael
Michaelides, Elias
Batra, Pete S.
Tajudeen, Bobby A.
OTO OPEN, 2024, 8 (02)
[34] The Performance of ChatGPT-4 and Gemini Ultra 1.0 for Quality Assurance Review in Emergency Medical Services Chest Pain Calls
Brant-Zawadzki, Graham
Klapthor, Brent
Ryba, Chris
Youngquist, Drew C.
Burton, Brooke
Palatinus, Helen
Youngquist, Scott T.
PREHOSPITAL EMERGENCY CARE, 2024,
[35] Evaluating ChatGPT-4 in medical education: an assessment of subject exam performance reveals limitations in clinical curriculum support for students
Mackey B.P.
Garabet R.
Maule L.
Tadesse A.
Cross J.
Weingarten M.
Discover Artificial Intelligence, 2024, 4 (01):
[36] Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank
Lee, Go Un
Hong, Dae Young
Kim, Sin Young
Kim, Jong Won
Lee, Young Hwan
Park, Sang O.
Lee, Kyeong Ryong
MEDICINE, 2024, 103 (09) : E37325
[37] Performance of ChatGPT-4 and Bard chatbots in responding to common patient questions on prostate cancer 177Lu-PSMA-617 therapy
Bilgin, Gokce Belge
Bilgin, Cem
Childs, Daniel S.
Orme, Jacob J.
Burkett, Brian J.
Packard, Ann T.
Johnson, Derek R.
Thorpe, Matthew P.
Riaz, Irbaz Bin
Halfdanarson, Thorvardur R.
Johnson, Geoffrey B.
Sartor, Oliver
Kendi, Ayse Tuba
FRONTIERS IN ONCOLOGY, 2024, 14
[38] Factors that Affect the National Student Performance Examination Grades of Brazilian Undergraduate Medical Programs
Neto, Toufic Anbar
Fucuta Pereia, Patricia da Silva
Nogueira, Mauricio L.
Pereira de Gody, Jose Maria
Moscardini, Airton C.
GMS JOURNAL FOR MEDICAL EDUCATION, 2018, 35 (01):
[39] Original Paper Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study
Flores-Cohaila, Javier A.
Garcia-Vicente, Abigail
Vizcarra-Jimenez, Sonia F.
De la Cruz-Galan, Janith
Gutierrez-Arratia, Jesus
Torres, Blanca Geraldine Quiroga
Taype-Rondan, Alvaro
JMIR MEDICAL EDUCATION, 2023, 9
[40] Exploring the Performance of ChatGPT-4 in the Taiwan Audiologist Qualification Examination: Preliminary Observational Study Highlighting the Potential of AI Chatbots in Hearing Care
Wang, Shangqiguo
Mo, Changgeng
Chen, Yuan
Dai, Xiaolu
Wang, Huiyi
Shen, Xiaoli
JMIR MEDICAL EDUCATION, 2024, 10

← 1 2 3 4 5 →