Vision-BioLLM: Large vision language model for visual dialogue in biomedical imagery

被引：0

作者：

Alshibli, Ahmad ^{[1
]}

Bazi, Yakoub ^{[2
]}

Rahhal, Mohamad Mahmoud Al ^{[3
]}

Zuair, Mansour ^{[2
]}

机构：

[1] King Saud Univ, Coll Comp & Informat Sci, Comp Sci Dept, Riyadh 11543, Saudi Arabia

[2] King Saud Univ, Coll Comp & Informat Sci, Comp Engn Dept, Riyadh 11543, Saudi Arabia

[3] King Saud Univ, Coll Appl Comp Sci, Appl Comp Sci Dept, Riyadh 11543, Saudi Arabia

来源：

BIOMEDICAL SIGNAL PROCESSING AND CONTROL | 2025年 / 103卷

关键词：

Large vision language model; Biomedical images; Transformers; Visual question answering; Captioning;

D O I：

10.1016/j.bspc.2024.107437

中图分类号：

R318 [生物医学工程];

学科分类号：

0831 ;

摘要：

In this paper, we present a vision-language model tailored for visual dialogue in the biomedical domain, utilizing a LanguageBind transformer as the vision encoder and Llama3-OpenBioLLM as the language decoder. Our training approach involves three stages: alignment, instruction-tuning, and task-specific fine-tuning. The alignment phase synchronizes outputs from the vision encoder with inputs to the decoder using a multi-layer perceptron (MLP). In the instruction-tuning phase, we enhance language comprehension through low-rank adaptation (LoRA) with a mixed dataset of general and biomedical images. We also improve three biomedical datasets by transforming visual question datasets into dialogue contexts and adding concise summaries of dialogues. Experimental results demonstrate the model's effectiveness against state-of-the-art methods, showcasing its potential to enhance biomedical visual dialogue. Code and models are available at: http://github.com/Big Data-KSU/Vision-BioLLM-KSU.

引用

页数：13

共 50 条

[21] Vision of the future: large language models in ophthalmology
Tailor, Prashant D.
D'Souza, Haley S.
Li, Hanzhou
Starr, Matthew R.
CURRENT OPINION IN OPHTHALMOLOGY, 2024, 35 (05) : 391 - 402
[22] Visual attention model for computer vision
Robert-Inacio, F.
Yushchenko, L.
BIOLOGICALLY INSPIRED COGNITIVE ARCHITECTURES, 2014, 7 : 26 - 38
[23] NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
Sammani, Fawaz
Mukherjee, Tanmoy
Deligiannis, Nikos
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8312 - 8322
[24] Unified Visual Relationship Detection with Vision and Language Models
Zhao, Long
Yuan, Liangzhe
Gong, Boqing
Cui, Yin
Schroff, Florian
Yang, Ming-Hsuan
Adam, Hartwig
Liu, Ting
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 6939 - 6950
[25] Vision and language: from visual perception to content creation
Mei, Tao
Zhang, Wei
Yao, Ting
APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2020, 9
[26] TALON: Improving Large Language Model Cognition with Tactility-Vision Fusion
Jiang, Xinyi
Wang, Guoming
Li, Huanhuan
Xia, Qinghua
Lu, Rongxing
Tang, Siliang
2024 IEEE 19TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS, ICIEA 2024, 2024,
[27] Distilling Large Vision-Language Model with Out-of-Distribution Generalizability
Li, Xuanlin
Fang, Yunhao
Liu, Minghua
Ling, Zhan
Tu, Zhuowen
Su, Hao
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2492 - 2503
[28] Hierarchical Vision and Language Transformer for Efficient Visual Dialog
He, Qiangqiang
Zhang, Mujie
Zhang, Jie
Yang, Shang
Wang, Chongjun
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VI, 2023, 14259 : 421 - 432
[29] IVTP: Instruction-Guided Visual Token Pruning for Large Vision-Language Models
Huang, Kai
Zou, Hao
Xi, Ye
Wang, BoChen
Xie, Zhen
Yu, Liang
COMPUTER VISION - ECCV 2024, PT XVII, 2025, 15075 : 214 - 230
[30] Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Leng, Sicong
Zhang, Hang
Chen, Guanzheng
Li, Xin
Lug, Shijian
Miao, Chunyan
Bing, Lidong
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13872 - 13882

← 1 2 3 4 5 →