Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions

被引:0
|
作者
Severino, Joao Victor Bruneti [1 ,2 ]
de Paula, Pedro Angelo Basei
Berger, Matheus Nespolo [1 ]
Loures, Filipe Silveira [3 ]
Todeschini, Solano Amadori [3 ]
Roeder, Eduardo Augusto [1 ,3 ]
Veiga, Maria Han [4 ]
Guedes, Murilo [2 ]
Marques, Gustavo Lenci [1 ,2 ,3 ]
机构
[1] Univ Fed Parana, Curitiba, Brazil
[2] Pontificia Univ Catolica Parana, Curitiba, Brazil
[3] Voa Hlth, Belo Horizonte, Brazil
[4] Ohio State Univ, Math, Columbus, OH USA
关键词
Artificial intelligence; Health Equity; Machine Learning; Medical Informatics Applications; Universal Health Care;
D O I
10.1136/bmjhci-2024-101195
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Objective The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.Methods This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.Results Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8x7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.Conclusions 10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.
引用
收藏
页数:4
相关论文
共 50 条
  • [1] Distractor Generation for Multiple-Choice Questions with Predictive Prompting and Large Language Models
    Bitew, Semere Kiros
    Deleu, Johannes
    Develder, Chris
    Demeester, Thomas
    MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2023, PT II, 2025, 2134 : 48 - 63
  • [2] Generating Contextualized Mathematics Multiple-Choice Questions Utilizing Large Language Models
    Li, Ruijia
    Wang, Yiting
    Zheng, Chanjin
    Jiang, Yuan-Hao
    Jiang, Bo
    ARTIFICIAL INTELLIGENCE IN EDUCATION: POSTERS AND LATE BREAKING RESULTS, WORKSHOPS AND TUTORIALS, INDUSTRY AND INNOVATION TRACKS, PRACTITIONERS, DOCTORAL CONSORTIUM AND BLUE SKY, AIED 2024, PT I, 2024, 2150 : 494 - 501
  • [3] FormScanner: Open-Source Solution for Grading Multiple-Choice Exams
    Young, Chadwick
    Lo, Glenn
    Young, Kaisa
    Borsetta, Alberto
    PHYSICS TEACHER, 2016, 54 (01): : 34 - 35
  • [4] Leveraging large language models to construct feedback from medical multiple-choice Questions
    Tomova, Mihaela
    Rosello Atanet, Ivan
    Sehy, Victoria
    Sieg, Miriam
    Maerz, Maren
    Maeder, Patrick
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [5] Large Language Models as Tools to Generate Radiology Board-Style Multiple-Choice Questions
    Mistry, Neel P.
    Saeed, Huzaifa
    Rafique, Sidra
    Le, Thuy
    Obaid, Haron
    Adams, Scott J.
    ACADEMIC RADIOLOGY, 2024, 31 (09) : 3872 - 3878
  • [6] Generation and Assessment of Multiple-Choice Questions from Video Transcripts using Large Language Models
    Arif, Taimoor
    Asthana, Sumit
    Collins-Thompson, Kevyn
    PROCEEDINGS OF THE ELEVENTH ACM CONFERENCE ON LEARNING@SCALE, L@S 2024, 2024, : 530 - 534
  • [7] Re: Open-Source Large Language Models in Radiology
    Kooraki, Soheil
    Bedayat, Arash
    ACADEMIC RADIOLOGY, 2024, 31 (10) : 4293 - 4293
  • [8] Servicing open-source large language models for oncology
    Ray, Partha Pratim
    ONCOLOGIST, 2024,
  • [9] Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions
    Du, Wei
    Jin, Xueting
    Harris, Jaryse Carol
    Brunetti, Alessandro
    Johnson, Erika
    Leung, Olivia
    Li, Xingchen
    Walle, Selemon
    Yu, Qing
    Zhou, Xiao
    Bian, Fang
    Mckenzie, Kajanna
    Kanathanavanich, Manita
    Ozcelik, Yusuf
    El-Sharkawy, Farah
    Koga, Shunsuke
    ANNALS OF DIAGNOSTIC PATHOLOGY, 2024, 73
  • [10] TeenyTinyLlama: Open-source tiny language models trained in Brazilian Portuguese
    Correa, Nicholas Kluge
    Falk, Sophia
    Fatimah, Shiza
    Sen, Aniket
    De Oliveira, Nythamar
    MACHINE LEARNING WITH APPLICATIONS, 2024, 16