Vision-BioLLM: Large vision language model for visual dialogue in biomedical imagery

被引:0
|
作者
Alshibli, Ahmad [1 ]
Bazi, Yakoub [2 ]
Rahhal, Mohamad Mahmoud Al [3 ]
Zuair, Mansour [2 ]
机构
[1] King Saud Univ, Coll Comp & Informat Sci, Comp Sci Dept, Riyadh 11543, Saudi Arabia
[2] King Saud Univ, Coll Comp & Informat Sci, Comp Engn Dept, Riyadh 11543, Saudi Arabia
[3] King Saud Univ, Coll Appl Comp Sci, Appl Comp Sci Dept, Riyadh 11543, Saudi Arabia
关键词
Large vision language model; Biomedical images; Transformers; Visual question answering; Captioning;
D O I
10.1016/j.bspc.2024.107437
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
In this paper, we present a vision-language model tailored for visual dialogue in the biomedical domain, utilizing a LanguageBind transformer as the vision encoder and Llama3-OpenBioLLM as the language decoder. Our training approach involves three stages: alignment, instruction-tuning, and task-specific fine-tuning. The alignment phase synchronizes outputs from the vision encoder with inputs to the decoder using a multi-layer perceptron (MLP). In the instruction-tuning phase, we enhance language comprehension through low-rank adaptation (LoRA) with a mixed dataset of general and biomedical images. We also improve three biomedical datasets by transforming visual question datasets into dialogue contexts and adding concise summaries of dialogues. Experimental results demonstrate the model's effectiveness against state-of-the-art methods, showcasing its potential to enhance biomedical visual dialogue. Code and models are available at: http://github.com/Big Data-KSU/Vision-BioLLM-KSU.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] FLAVA: A Foundational Language And Vision Alignment Model
    Singh, Amanpreet
    Hu, Ronghang
    Goswami, Vedanuj
    Couairon, Guillaume
    Galuba, Wojciech
    Rohrbach, Marcus
    Kiela, Douwe
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15617 - 15629
  • [42] ChartFormer: A Large Vision Language Model for Converting Chart Images into Tactile Accessible SVGs
    Moured, Omar
    Alzalabny, Sara
    Osman, Anas
    Schwarz, Thorsten
    Mueller, Karin
    Stiefelhagen, Rainer
    COMPUTERS HELPING PEOPLE WITH SPECIAL NEEDS, PT I, ICCHP 2024, 2024, 14750 : 299 - 305
  • [43] BRAVE: Broadening the Visual Encoding of Vision-Language Models
    Kar, Oguzhan Fatih
    Tonioni, Alessio
    Poklukar, Petra
    Kulshrestha, Achin
    Zamir, Amir
    Tombari, Federico
    COMPUTER VISION - ECCV 2024, PT XVI, 2025, 15074 : 113 - 132
  • [44] Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models
    Iki, Taichi
    Aizawa, Akiko
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 2189 - 2196
  • [45] Interpreting vision and language generative models with semantic visual priors
    Cafagna, Michele
    Rojas-Barahona, Lina M.
    van Deemter, Kees
    Gatt, Albert
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2023, 6
  • [46] Nouns for visual objects: A hypothesis of the vision-language interface
    Ursini, Francesco-Alessio
    Acquaviva, Paolo
    LANGUAGE SCIENCES, 2019, 72 : 50 - 70
  • [47] Aligning vision-language for graph inference in visual dialog
    Jiang, Tianling
    Shao, Hailin
    Tian, Xin
    Ji, Yi
    Liu, Chunping
    IMAGE AND VISION COMPUTING, 2021, 116
  • [48] VinVL: Revisiting Visual Representations in Vision-Language Models
    Zhang, Pengchuan
    Li, Xiujun
    Hu, Xiaowei
    Yang, Jianwei
    Zhang, Lei
    Wang, Lijuan
    Choi, Yejin
    Gao, Jianfeng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5575 - 5584
  • [49] Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics
    Guo, Xiaoyuan
    Duan, Jiali
    Kuo, C. -C. Jay
    Gichoya, Judy Wawira
    Banerjee, Imon
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 4779 - 4785
  • [50] ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling
    Yan, Siming
    Bai, Min
    Chen, Weifeng
    Zhou, Xiong
    Huang, Qixing
    Liz, Li Erran
    COMPUTER VISION - ECCV 2024, PT LXI, 2025, 15119 : 37 - 53