Multi-scale dual-stream visual feature extraction and graph reasoning for visual question answering

被引:0
|
作者
Yusuf, Abdulganiyu Abdu [1 ,4 ]
Feng, Chong [1 ,2 ]
Mao, Xianling [1 ,3 ]
Li, Xinyan [5 ]
Haruna, Yunusa [6 ]
Duma, Ramadhani Ally [1 ]
机构
[1] Beijing Inst Technol, Sch Comp Sci & Technol, Beijing 10008, Peoples R China
[2] Beijing Inst Technol, South East Informat Technol Inst, Beijing 10008, Peoples R China
[3] Beijing Engn Res Ctr High Volume Language Informat, Beijing, Peoples R China
[4] Natl Biotechnol Dev Agcy, Abuja, Nigeria
[5] China North Vehicle Res Inst, Informat & Control Dept, Beijing 100072, Peoples R China
[6] Beihang Univ, Sch Automat Sci & Elect Engn, Beijing, Peoples R China
关键词
Visual question answering; Dual-stream features; Visual graph reasoning; Visual semantics; Attention mechanisms; FUSION;
D O I
10.1007/s10489-025-06325-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advancements in deep learning algorithms have significantly expanded the capabilities of systems to handle vision-to-language (V2L) tasks. Visual question answering (VQA) presents challenges that require a deep understanding of visual and language content to perform complex reasoning tasks. The existing VQA models often rely on grid-based or region-based visual features, which capture global context and object-specific details, respectively. However, balancing the complementary strengths of each feature type while minimizing fusion noise remains a significant challenge. This study propose a multi-scale dual-stream visual feature extraction method that combines grid and region features to enhance both global and local visual feature representations. Also, a visual graph relational reasoning (VGRR) approach is proposed to further improve reasoning by constructing a graph that models spatial and semantic relationships between visual objects, using Graph Attention Networks (GATs) for relational reasoning. To enhance the interaction between visual and textual modalities, we further propose a cross-modal self-attention fusion strategy, which enables the model to focus selectively on the most relevant parts of both the image and the question. The proposed model is evaluated on the VQA 2.0 and GQA benchmark datasets, demonstrating competitive performance with significant accuracy improvements compared to state-of-the-art methods. Ablation studies confirm the effectiveness of each module in enhancing visual-textual understanding and answer prediction.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Modular dual-stream visual fusion network for visual question answering
    Xue, Lixia
    Wang, Wenhao
    Wang, Ronggui
    Yang, Juan
    VISUAL COMPUTER, 2025, 41 (01): : 549 - 562
  • [2] DSAMR: Dual-Stream Attention Multi-hop Reasoning for knowledge-based visual question answering
    Sun, Yanhan
    Zhu, Zhenfang
    Zuo, Zicheng
    Li, Kefeng
    Gong, Shuai
    Qi, Jiangtao
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 245
  • [3] Multi-scale Relational Reasoning with Regional Attention for Visual Question Answering
    Ma, Yuntao
    Lu, Tong
    Wu, Yirui
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5642 - 5649
  • [4] Multi-scale relation reasoning for multi-modal Visual Question Answering
    Wu, Yirui
    Ma, Yuntao
    Wan, Shaohua
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2021, 96
  • [5] Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
    Shehzad, Faheem
    Minutolo, Aniello
    Esposito, Massimo
    IEEE Access, 2024, 12 : 195561 - 195574
  • [6] DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
    Wang, Jianyu
    Bao, Bing-Kun
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 24 : 3369 - 3380
  • [7] Enhancing Remote Sensing Visual Question Answering: A Mask-Based Dual-Stream Feature Mutual Attention Network
    Li, Yangyang
    Ma, Yunfei
    Liu, Guangyuan
    Wei, Qiang
    Chen, Yanqiao
    Shang, Ronghua
    Jiao, Licheng
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5
  • [8] Dual-stream autoencoder for channel-level multi-scale feature extraction in hyperspectral unmixing
    Gan, Yuquan
    Wang, Yong
    Li, Qiuyu
    Luo, Yiming
    Wang, Yihong
    Pan, Yushan
    Knowledge-Based Systems, 2025, 317
  • [9] Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering
    Koner, Rajat
    Li, Hang
    Hildebrandt, Marcel
    Das, Deepan
    Tresp, Volker
    Guennemann, Stephan
    SEMANTIC WEB - ISWC 2021, 2021, 12922 : 111 - 127
  • [10] A question-guided multi-hop reasoning graph network for visual question answering
    Xu, Zhaoyang
    Gu, Jinguang
    Liu, Maofu
    Zhou, Guangyou
    Fu, Haidong
    Qiu, Chen
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)