Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam

被引:4
|
作者
Mendonca, Nabor C. [1 ]
机构
[1] Univ Fortaleza, Postgrad Program Appl Informat, Av Washington Soares, Fortaleza, Ceara, Brazil
来源
关键词
Multimodal generative AI; ChatGPT-4; vision; educational assessment; computer science education;
D O I
10.1145/3674149
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
The recent integration of visual capabilities into Large Language Models (LLMs) has the potential to play a pivotal role in science and technology education, where visual elements such as diagrams, charts, and tables are commonly used to improve the learning experience. This study investigates the performance of ChatGPT-4 Vision, OpenAI's most advanced visual model at the time the study was conducted, on the Bachelor in Computer Science section of Brazil's 2021 National Undergraduate Exam (ENADE). By presenting the model with the exam's open and multiple-choice questions in their original image format and allowing for reassessment in response to differing answer keys, we were able to evaluate the model's reasoning and self- reflecting capabilities in a large-scale academic assessment involving textual and visual content. ChatGPT-4 Vision significantly outperformed the average exam participant, positioning itself within the top 10 best score percentile. While it excelled in questions that incorporated visual elements, it also encountered challenges with question interpretation, logical reasoning, and visual acuity. A positive correlation between the model's performance in multiple-choice questions and the performance distribution of the human participants suggests multimodal LLMs can provide a useful tool for question testing and refinement. However, the involvement of an independent expert panel to review cases of disagreement between the model and the answer key revealed some poorly constructed questions containing vague or ambiguous statements, calling attention to the critical need for improved question design in future exams. Our findings suggest that while ChatGPT-4 Vision shows promise in multimodal academic evaluations, human oversight remains crucial for verifying the model's accuracy and ensuring the fairness of high-stakes educational exams. The paper's research materials are publicly available at https://github.com/nabormendonca/gpt-4v-enade-cs-2021.
引用
收藏
页数:56
相关论文
共 50 条
  • [31] Performance of ChatGPT-4 in answering questions from the Brazilian National Examination for Medical Degree Revalidation
    Wiwanitkit, Somsri
    Wiwanitkit, Viroj
    REVISTA DA ASSOCIACAO MEDICA BRASILEIRA, 2024, 70 (03):
  • [32] Evaluating Artificial Intelligence Efficacy: A Comparative Study between ChatGPT-4's Treatment Recommendations and Orthopaedic Clinical Practice Guidelines
    Dagher, Tanios
    Dwyer, Emma
    Baker, Hayden P.
    Kalidoss, Senthooran
    Strelzow, Jason
    JOURNAL OF THE AMERICAN COLLEGE OF SURGEONS, 2024, 239 (05) : S325 - S326
  • [33] Performance of ChatGPT-4 in answering questions from the Brazilian National Examination for Medical Degree Revalidation
    Gobira, Mauro
    Nakayama, Luis Filipe
    Moreira, Rodrigo
    Andrade, Eric
    Regatieri, Caio Vinicius Saito
    Belfort Jr, Rubens
    REVISTA DA ASSOCIACAO MEDICA BRASILEIRA, 2023, 69 (10):
  • [34] ChatGPT in the Classroom: An Analysis of Its Strengths and Weaknesses for Solving Undergraduate Computer Science Questions
    Joshi, Ishika
    Budhiraja, Ritvik
    Dev, Harshal
    Kadia, Jahnvi
    Ataullah, Mohammad Osama
    Mitra, Sayan
    Akolekar, Harshal D.
    Kumar, Dhruv
    PROCEEDINGS OF THE 55TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE 2024, VOL. 1, 2024, : 625 - 631
  • [35] EDUCATIONAL EVALUATION WITH LARGE LANGUAGE MODELS (LLMS): CHATGPT-4 IN RECALLING AND EVALUATING STUDENTS' WRITTEN RESPONSES
    Jauhiainen, Jussi S.
    Bernardo, Agustin
    Guerra, Garagorry
    JOURNAL OF INFORMATION TECHNOLOGY EDUCATION-INNOVATIONS IN PRACTICE, 2025, 24
  • [36] Letter to the editor on: "AI versus MD: Evaluating the surgical decision-making accuracy of ChatGPT-4"
    Daungsupawong, Hinpetch
    Wiwanitkit, Viroj
    SURGERY, 2024, 176 (06) : 1782 - 1782
  • [37] Performance of ChatGPT and GPT-4 on Polish National Specialty Exam (NSE) in Ophthalmology
    Ciekalski, Marcin
    Laskowski, Maciej
    Koperczak, Agnieszka
    Smierciak, Maria
    Sirek, Sebastian
    POSTEPY HIGIENY I MEDYCYNY DOSWIADCZALNEJ, 2024, 78 (01): : 111 - 116
  • [38] Evaluating AI Capabilities in Bariatric Surgery: A Study on ChatGPT-4 and DALL<middle dot>E 3's Recognition and Illustration Accuracy
    Mahjoubi, Mohammad
    Shahabi, Shahab
    Sheikhbahaei, Saba
    Jazi, Amir Hossein Davarpanah
    OBESITY SURGERY, 2025, 35 (02) : 638 - 641
  • [39] Impact of Attached File Formats on the Performance of ChatGPT-4 on the Japanese National Nursing Examination: Evaluation Study
    Taira, Kazuya
    Itaya, Takahiro
    Yada, Shuntaro
    Hiyama, Kirara
    Hanada, Ayame
    JMIR NURSING, 2025, 8
  • [40] Turing's Vision: The Birth of Computer Science
    Nichols, Tiffany
    BRITISH JOURNAL FOR THE HISTORY OF SCIENCE, 2017, 50 (02): : 366 - 368