Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI

被引:0
|
作者
Lee, Gyeonggeon [1 ,2 ]
Zhai, Xiaoming [2 ,3 ,4 ]
机构
[1] Natl Inst Educ, Nat Sci & Sci Educ Dept, Nat Sci & Sci Educ, 1 Nanyang Walk, Singapore 637616, Singapore
[2] Univ Georgia, AI4STEM Educ Ctr, 110 Carlton St, Athens, GA 30602 USA
[3] Univ Georgia, Natl GENIUS Ctr, 110 Carlton St, Athens, GA 30602 USA
[4] Univ Georgia, Dept Math Sci & Social Studies Educ, 110 Carlton St, Athens, GA 30602 USA
基金
美国国家科学基金会;
关键词
Artificial intelligence (AI); GPT-4V(ision); Visual question answering; Vision language model; Multimodality;
D O I
10.1007/s11528-024-01035-z
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Educators and researchers have analyzed various image data acquired from teaching and learning, such as images of learning materials, classroom dynamics, students' drawings, etc. However, this approach is labour-intensive and time-consuming, limiting its scalability and efficiency. The recent development in the Visual Question Answering (VQA) technique has streamlined this process by allowing users to posing questions about the images and receive accurate and automatic answers, both in natural language, thereby enhancing efficiency and reducing the time required for analysis. State-of-the-art Vision Language Models (VLMs) such as GPT-4V(ision) have extended the applications of VQA to a wide range of educational purposes. This report employs GPT-4V as an example to demonstrate the potential of VLM in enabling and advancing VQA for education. Specifically, we demonstrated that GPT-4V enables VQA for educational scholars without requiring technical expertise, thereby reducing accessibility barriers for general users. In addition, we contend that GPT-4V spotlights the transformative potential of VQA for educational research, representing a milestone accomplishment for visual data analysis in education.
引用
收藏
页码:271 / 287
页数:17
相关论文
共 50 条
  • [31] Dual-Key Multimodal Backdoors for Visual Question Answering
    Walmer, Matthew
    Sikka, Karan
    Sur, Indranil
    Shrivastava, Abhinav
    Jha, Susmit
    arXiv, 2021,
  • [32] Improving Visual Question Answering by Multimodal Gate Fusion Network
    Xiang, Shenxiang
    Chen, Qiaohong
    Fang, Xian
    Guo, Menghao
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [33] Putting ChatGPT vision (GPT-4V) to the test: risk perception in traffic images
    Driessen, Tom
    Dodou, Dimitra
    Bazilinskyy, Pavlo
    de Winter, Joost
    ROYAL SOCIETY OPEN SCIENCE, 2024, 11 (05):
  • [34] Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data
    Zhang, Chenhui
    Wang, Sherrie
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2024, : 7839 - 7849
  • [35] Glaucoma Detection and Feature Identification via GPT-4V Fundus Image Analysis
    Jalili, Jalil
    Jiravarnsirikul, Anuwat
    Bowd, Christopher
    Chuter, Benton
    Belghith, Akram
    Goldbaum, Michael H.
    Baxter, Sally L.
    Weinreb, Robert N.
    Zangwill, Linda M.
    Christopher, Mark
    OPHTHALMOLOGY SCIENCE, 2025, 5 (02):
  • [36] GPT-4V with emotion: A zero-shot benchmark for Generalized Emotion Recognition
    Lian, Zheng
    Sun, Licai
    Sun, Haiyang
    Chen, Kang
    Wen, Zhuofan
    Gu, Hao
    Liu, Bin
    Tao, Jianhua
    INFORMATION FUSION, 2024, 108
  • [37] Advancements in AI for Gastroenterology Education: An Assessment of OpenAI's GPT-4 and GPT-3.5 in MKSAP Question Interpretation
    Patel, Akash
    Samreen, Isha
    Ahmed, Imran
    AMERICAN JOURNAL OF GASTROENTEROLOGY, 2024, 119 (10S): : S1580 - S1580
  • [38] Revolution or risk?-Assessing the potential and challenges of GPT-4V in radiologic image interpretation
    Huppertz, Marc Sebastian
    Siepmann, Robert
    Topp, David
    Nikoubashman, Omid
    Yueksel, Can
    Kuhl, Christiane Katharina
    Truhn, Daniel
    Nebelung, Sven
    EUROPEAN RADIOLOGY, 2025, 35 (03) : 1111 - 1121
  • [39] Visual Experience-Based Question Answering with Complex Multimodal Environments
    Kim, Incheol
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2020, 2020 (2020)
  • [40] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
    Chen, Chongqing
    Han, Dezhi
    Wang, Jun
    IEEE ACCESS, 2020, 8 : 35662 - 35671