CooKie: commonsense knowledge-guided mixture-of-experts framework for fine-grained visual question answering

被引:0
|
作者
Wang, Chao [1 ,2 ]
Yang, Jianming [1 ,2 ]
Zhou, Yang [3 ,4 ]
Yue, Xiaodong [1 ,2 ]
机构
[1] Shanghai Univ, Sch Future Technol, Shanghai 200444, Peoples R China
[2] Shanghai Univ, Inst Artificial Intelligence, Shanghai 200444, Peoples R China
[3] Shanghai Univ, Sch Comp Engn & Sci, Shanghai 200444, Peoples R China
[4] Shanghai Artificial Intelligence Lab, Shanghai 201114, Peoples R China
基金
上海市自然科学基金;
关键词
Visual question answering; Multimodal large language models; Visual search; Commonsense knowledge; Object hallucinations;
D O I
10.1016/j.ins.2024.121742
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the FG-VQA task, VQA models need to answer questions about the detailed visual content of images using open-world knowledge. Recently, MLLMs based on large language models have shown powerful multimodal performance across diverse tasks including FG-VQA. Nonetheless, those end-to-end MLLMs' generalization performance remains limited due to visual information loss caused by a lack of flexibility (i.e. one-offtask-agnostic visual perception), especially in high-resolution scenarios. As an alternative solution, another series of methods utilize an LLM-based tool-using system, adopting flexible strategies and calling expert models to handle various situations. However, these solutions still achieve unsatisfactory improvement due to the following limitations: 1) Simplistic task decomposition ignoring the potential failure of sub-tasks, and 2) Over-liance on the expert models' limited ability. To this end, we extend the use of LLM-based system and propose a simple but effective framework called CooKie (COmmOnsense Knowled-ge-uided MIxture-of-Experts), which leverages the inherent knowledge of an MLLM as guidance for conducting visual searches in a manner similar to human behavior. In addition to this region-aware search strategy, we introduce a robust post-hoc selection module with a Mixture-of-xperts strategy, integrating another expert model to provide a supplementary reference. In the experiments, CooKie showcases a new state-of-the-art (SOTA) performance of 79.06% accuracy on V* Bench. Meanwhile, it achieves about 3% improvement on POPE and GQA with comparative performance on MME-Bench. This indicates that our method specializes in high-resolution FG-VQA tasks as well as object hallucination reduction and maintains the general multimodal capability. Furthermore, we manually construct additional instances as the expansion of V* Bench to better evaluate the generalization performance of our method, demonstrating CooKie's effectiveness.
引用
收藏
页数:20
相关论文
共 8 条
  • [1] Enhancing Mixture-of-Experts by Leveraging Attention for Fine-Grained Recognition
    Zhang, Lianbo
    Huang, Shaoli
    Liu, Wei
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 4409 - 4421
  • [2] Fine-grained knowledge fusion for retrieval-augmented medical visual question answering
    Liang, Xiao
    Wang, Di
    Jing, Bin
    Jiao, Zhicheng
    Li, Ronghan
    Liu, Ruyi
    Miao, Qiguang
    Wang, Quan
    INFORMATION FUSION, 2025, 120
  • [3] Fine-Grained Unbalanced Interaction Network for Visual Question Answering
    Liao, Xinxin
    Wu, Mingyan
    Chai, Heyan
    Qi, Shuhan
    Wang, Xuan
    Liao, Qing
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT III, 2021, 12817 : 85 - 97
  • [4] Plenty is Plague: Fine-Grained Learning for Visual Question Answering
    Zhou, Yiyi
    Ji, Rongrong
    Sun, Xiaoshuai
    Su, Jinsong
    Meng, Deyu
    Gao, Yue
    Shen, Chunhua
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (02) : 697 - 709
  • [5] A Knowledge-Guided Framework for Fine-Grained Classification of Liver Lesions Based on Multi-Phase CT Images
    Xu, Xingxin
    Zhu, Qikui
    Ying, Hanning
    Li, Jiongcheng
    Cai, Xiujun
    Li, Shuo
    Liu, Xiaoqing
    Yu, Yizhou
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (01) : 386 - 396
  • [6] FiTs: Fine-Grained Two-Stage Training for Knowledge-Aware Question Answering
    Ye, Qichen
    Cao, Bowen
    Chen, Nuo
    Xu, Weiyuan
    Zou, Yuexian
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13914 - 13922
  • [7] GKA: Graph-guided knowledge association for fine-grained visual categorization
    Wang, Yuetian
    Ye, Shuo
    Hou, Wenjin
    Xu, Duanquan
    You, Xinge
    NEUROCOMPUTING, 2025, 634
  • [8] Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
    Lin, Weizhe
    Chen, Jinghong
    Mei, Jingbiao
    Coca, Alexandru
    Byrne, Bill
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,