Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam

被引:4
|
作者
Mendonca, Nabor C. [1 ]
机构
[1] Univ Fortaleza, Postgrad Program Appl Informat, Av Washington Soares, Fortaleza, Ceara, Brazil
来源
关键词
Multimodal generative AI; ChatGPT-4; vision; educational assessment; computer science education;
D O I
10.1145/3674149
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
The recent integration of visual capabilities into Large Language Models (LLMs) has the potential to play a pivotal role in science and technology education, where visual elements such as diagrams, charts, and tables are commonly used to improve the learning experience. This study investigates the performance of ChatGPT-4 Vision, OpenAI's most advanced visual model at the time the study was conducted, on the Bachelor in Computer Science section of Brazil's 2021 National Undergraduate Exam (ENADE). By presenting the model with the exam's open and multiple-choice questions in their original image format and allowing for reassessment in response to differing answer keys, we were able to evaluate the model's reasoning and self- reflecting capabilities in a large-scale academic assessment involving textual and visual content. ChatGPT-4 Vision significantly outperformed the average exam participant, positioning itself within the top 10 best score percentile. While it excelled in questions that incorporated visual elements, it also encountered challenges with question interpretation, logical reasoning, and visual acuity. A positive correlation between the model's performance in multiple-choice questions and the performance distribution of the human participants suggests multimodal LLMs can provide a useful tool for question testing and refinement. However, the involvement of an independent expert panel to review cases of disagreement between the model and the answer key revealed some poorly constructed questions containing vague or ambiguous statements, calling attention to the critical need for improved question design in future exams. Our findings suggest that while ChatGPT-4 Vision shows promise in multimodal academic evaluations, human oversight remains crucial for verifying the model's accuracy and ensuring the fairness of high-stakes educational exams. The paper's research materials are publicly available at https://github.com/nabormendonca/gpt-4v-enade-cs-2021.
引用
收藏
页数:56
相关论文
共 50 条
  • [1] Evaluating the impact of ChatGPT-4 on medical abstracts
    Gravel, Jocelyn
    Dion, Chloe
    Kermani, Mandana Fadaei
    Mousseau, Sarah
    Osmanlliu, Esli
    PAEDIATRICS & CHILD HEALTH, 2024, 29 : e45 - e46
  • [2] Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration
    Hirosawa, Takanobu
    Harada, Yukinori
    Tokumasu, Kazuki
    Ito, Takahiro
    Suzuki, Tomoharu
    Shimizu, Taro
    JMIR MEDICAL INFORMATICS, 2024, 12
  • [3] Evaluating the accuracy of ChatGPT-4 in predicting ASA scores: A prospective multicentric study ChatGPT-4 in ASA score prediction
    Turan, Engin Ihsan
    Baydemir, Abdurrahman Engin
    Ozcan, Funda Gumus
    Sahin, Ayca Sultan
    JOURNAL OF CLINICAL ANESTHESIA, 2024, 96
  • [4] Evaluating ChatGPT-4's performance as a digital health advisor for otosclerosis surgery
    Sahin, Samil
    Erkmen, Burak
    Duymaz, Yasar Kemal
    Bayram, Furkan
    Tekin, Ahmet Mahmut
    Topsakal, Vedat
    FRONTIERS IN SURGERY, 2024, 11
  • [5] A Comparative Analysis of ChatGPT, ChatGPT-4, and Google Bard Performances at the Advanced Burn Life Support Exam
    Alessandri-Bonetti, Mario
    Liu, Hilary Y.
    Donovan, James M.
    Ziembicki, Jenny A.
    Egro, Francesco M.
    JOURNAL OF BURN CARE & RESEARCH, 2024, 45 (04): : 945 - 948
  • [6] Evaluating ChatGPT-4 in medical education: an assessment of subject exam performance reveals limitations in clinical curriculum support for students
    Mackey B.P.
    Garabet R.
    Maule L.
    Tadesse A.
    Cross J.
    Weingarten M.
    Discover Artificial Intelligence, 2024, 4 (01):
  • [7] Revolutionizing Diagnostics: Evaluating ChatGPT-4's Performance in Ulcerative Colitis Endoscopic Assessment
    Levartovsky, A.
    Albshesh, A.
    Grinman, A.
    Shachar, E.
    Lahat, A.
    Eliakim, R.
    Kopylov, U.
    JOURNAL OF CROHNS & COLITIS, 2025, 19 : I748 - I748
  • [8] Evaluating ChatGPT-4's historical accuracy: a case study on the origins of SWOT analysis
    Puyt, Richard W.
    Madsen, Dag oivind
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2024, 7
  • [9] Ensuring Consistency and Accuracy in Evaluating ChatGPT-4 for Clinical Recommendations
    Zhu, Lingxuan
    Mou, Weiming
    Luo, Peng
    CLINICAL GASTROENTEROLOGY AND HEPATOLOGY, 2025, 23 (01) : 189 - 190
  • [10] Letter to the editor, "Evaluating the accuracy of ChatGPT-4 in predicting ASA scores: A prospective multicentric study ChatGPT-4 in ASA score prediction"
    Zhang, Chenghong
    Chen, Xinzhong
    JOURNAL OF CLINICAL ANESTHESIA, 2024, 98