Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam

被引：4

作者：

Mendonca, Nabor C. ^{[1
]}

机构：

[1] Univ Fortaleza, Postgrad Program Appl Informat, Av Washington Soares, Fortaleza, Ceara, Brazil

来源：

ACM TRANSACTIONS ON COMPUTING EDUCATION | 2024年 / 24卷 / 03期

关键词：

Multimodal generative AI; ChatGPT-4; vision; educational assessment; computer science education;

D O I：

10.1145/3674149

中图分类号：

G40 [教育学];

学科分类号：

040101 ; 120403 ;

摘要：

The recent integration of visual capabilities into Large Language Models (LLMs) has the potential to play a pivotal role in science and technology education, where visual elements such as diagrams, charts, and tables are commonly used to improve the learning experience. This study investigates the performance of ChatGPT-4 Vision, OpenAI's most advanced visual model at the time the study was conducted, on the Bachelor in Computer Science section of Brazil's 2021 National Undergraduate Exam (ENADE). By presenting the model with the exam's open and multiple-choice questions in their original image format and allowing for reassessment in response to differing answer keys, we were able to evaluate the model's reasoning and self- reflecting capabilities in a large-scale academic assessment involving textual and visual content. ChatGPT-4 Vision significantly outperformed the average exam participant, positioning itself within the top 10 best score percentile. While it excelled in questions that incorporated visual elements, it also encountered challenges with question interpretation, logical reasoning, and visual acuity. A positive correlation between the model's performance in multiple-choice questions and the performance distribution of the human participants suggests multimodal LLMs can provide a useful tool for question testing and refinement. However, the involvement of an independent expert panel to review cases of disagreement between the model and the answer key revealed some poorly constructed questions containing vague or ambiguous statements, calling attention to the critical need for improved question design in future exams. Our findings suggest that while ChatGPT-4 Vision shows promise in multimodal academic evaluations, human oversight remains crucial for verifying the model's accuracy and ensuring the fairness of high-stakes educational exams. The paper's research materials are publicly available at https://github.com/nabormendonca/gpt-4v-enade-cs-2021.

引用

页数：56

共 50 条

[31] Performance of ChatGPT-4 in answering questions from the Brazilian National Examination for Medical Degree Revalidation
Wiwanitkit, Somsri
Wiwanitkit, Viroj
REVISTA DA ASSOCIACAO MEDICA BRASILEIRA, 2024, 70 (03):
[32] Evaluating Artificial Intelligence Efficacy: A Comparative Study between ChatGPT-4's Treatment Recommendations and Orthopaedic Clinical Practice Guidelines
Dagher, Tanios
Dwyer, Emma
Baker, Hayden P.
Kalidoss, Senthooran
Strelzow, Jason
JOURNAL OF THE AMERICAN COLLEGE OF SURGEONS, 2024, 239 (05) : S325 - S326
[33] Performance of ChatGPT-4 in answering questions from the Brazilian National Examination for Medical Degree Revalidation
Gobira, Mauro
Nakayama, Luis Filipe
Moreira, Rodrigo
Andrade, Eric
Regatieri, Caio Vinicius Saito
Belfort Jr, Rubens
REVISTA DA ASSOCIACAO MEDICA BRASILEIRA, 2023, 69 (10):
[34] ChatGPT in the Classroom: An Analysis of Its Strengths and Weaknesses for Solving Undergraduate Computer Science Questions
Joshi, Ishika
Budhiraja, Ritvik
Dev, Harshal
Kadia, Jahnvi
Ataullah, Mohammad Osama
Mitra, Sayan
Akolekar, Harshal D.
Kumar, Dhruv
PROCEEDINGS OF THE 55TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE 2024, VOL. 1, 2024, : 625 - 631
[35] EDUCATIONAL EVALUATION WITH LARGE LANGUAGE MODELS (LLMS): CHATGPT-4 IN RECALLING AND EVALUATING STUDENTS' WRITTEN RESPONSES
Jauhiainen, Jussi S.
Bernardo, Agustin
Guerra, Garagorry
JOURNAL OF INFORMATION TECHNOLOGY EDUCATION-INNOVATIONS IN PRACTICE, 2025, 24
[36] Letter to the editor on: "AI versus MD: Evaluating the surgical decision-making accuracy of ChatGPT-4"
Daungsupawong, Hinpetch
Wiwanitkit, Viroj
SURGERY, 2024, 176 (06) : 1782 - 1782
[37] Performance of ChatGPT and GPT-4 on Polish National Specialty Exam (NSE) in Ophthalmology
Ciekalski, Marcin
Laskowski, Maciej
Koperczak, Agnieszka
Smierciak, Maria
Sirek, Sebastian
POSTEPY HIGIENY I MEDYCYNY DOSWIADCZALNEJ, 2024, 78 (01): : 111 - 116
[38] Evaluating AI Capabilities in Bariatric Surgery: A Study on ChatGPT-4 and DALL<middle dot>E 3's Recognition and Illustration Accuracy
Mahjoubi, Mohammad
Shahabi, Shahab
Sheikhbahaei, Saba
Jazi, Amir Hossein Davarpanah
OBESITY SURGERY, 2025, 35 (02) : 638 - 641
[39] Impact of Attached File Formats on the Performance of ChatGPT-4 on the Japanese National Nursing Examination: Evaluation Study
Taira, Kazuya
Itaya, Takahiro
Yada, Shuntaro
Hiyama, Kirara
Hanada, Ayame
JMIR NURSING, 2025, 8
[40] Turing's Vision: The Birth of Computer Science
Nichols, Tiffany
BRITISH JOURNAL FOR THE HISTORY OF SCIENCE, 2017, 50 (02): : 366 - 368

← 1 2 3 4 5 →