Incorporating 3D Information into Visual Question Answering

被引:4
|
作者
Qiu, Yue [1 ,2 ]
Satoh, Yutaka [1 ,2 ]
Suzuki, Ryota [1 ]
Kataoka, Hirokatsu [1 ]
机构
[1] Natl Inst Adv Ind Sci & Technol, Tsukuba, Ibaraki, Japan
[2] Univ Tsukuba, Tsukuba, Ibaraki, Japan
关键词
D O I
10.1109/3DV.2019.00088
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a tactic of advancing Visual Question Answering (VQA) task by incorporating 3D information via multi-view images. Conventional VQA approaches, which reply an answer in words against a linguistic question about a given RGB image, have less ability to recognize geometrical information so that they tend to fail to count things or guess positional relationship. Moreover, they have no ability to determine blinded space, so it is not feasible to invent VQA function to robots which will work in highly-occluded real-world environments. To achieve the situation, we introduce a new multi-view VQA dataset along with an approach that incorporating 3D scene information directly captured from multi-view images into VQA without using depth images or employing SLAM. Our proposed approach achieves strong performance with an overall accuracy of 95.4% on the challenging multi-view VQA dataset setup, which contains relatively severe occlusion. This work also demonstrates the promising aspects of bridging the gap between 3D vision and language.
引用
收藏
页码:756 / 765
页数:10
相关论文
共 50 条
  • [1] 3D Question Answering
    Ye, Shuquan
    Chen, Dongdong
    Han, Songfang
    Liao, Jing
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (03) : 1772 - 1786
  • [2] 3D Question Answering
    Ye, Shuquan
    Chen, Dongdong
    Han, Songfang
    Liao, Jing
    IEEE Transactions on Visualization and Computer Graphics, 2022, 30 (03) : 1772 - 1786
  • [3] 3DVQA: Visual Question Answering for 3D Environments
    Etesam, Yasaman
    Kochiev, Leon
    Chang, Angel X.
    2022 19TH CONFERENCE ON ROBOTS AND VISION (CRV 2022), 2022, : 233 - 240
  • [4] Incorporating Verb Semantic Information in Visual Question Answering Through Multitask Learning Paradigm
    Alizadeh, Mehrdad
    Di Eugenio, Barbara
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2020, 14 (02) : 223 - 248
  • [5] Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA
    Mo, Wentao
    Liu, Yang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4261 - 4268
  • [6] Information fusion in visual question answering: A Survey
    Zhang, Dongxiang
    Cao, Rui
    Wu, Sai
    INFORMATION FUSION, 2019, 52 : 268 - 280
  • [7] Toward Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline
    Zhao, Lichen
    Cai, Daigang
    Zhang, Jing
    Sheng, Lu
    Xu, Dong
    Zheng, Rui
    Zhao, Yinjie
    Wang, Lipeng
    Fan, Xibo
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (06) : 2935 - 2949
  • [8] Improving reasoning with contrastive visual information for visual question answering
    Long, Yu
    Tang, Pengjie
    Wang, Hanli
    Yu, Jian
    ELECTRONICS LETTERS, 2021, 57 (20) : 758 - 760
  • [9] ScanQA: 3D Question Answering for Spatial Scene Understanding
    Azuma, Laichi
    Miyanishi, Taiki
    Kurita, Shuhei
    Kawanahe, Motoaki
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19107 - 19117
  • [10] Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering
    Li, Hao
    Huang, Jinfa
    Jin, Peng
    Song, Guoli
    Wu, Qi
    Chen, Jie
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3367 - 3382