Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation

被引:0
|
作者
Yan, Xu [1 ]
Yuan, Zhihao [1 ]
Du, Yuhao [1 ]
Liao, Yinghong [1 ]
Guo, Yao [2 ]
Cui, Shuguang [1 ]
Li, Zhen [1 ]
机构
[1] Chinese Univ Hong Kong, Future Network Intelligence Inst, Sch Sci & Engn, Shenzhen 518172, Peoples R China
[2] Shanghai Jiao Tong Univ, Inst Med Robot, Shanghai 200240, Peoples R China
关键词
Three-dimensional displays; Image color analysis; Visualization; Task analysis; Engines; Solid modeling; Semantics; Deep learning; vision and language; 3D visual question answering; point cloud processing;
D O I
10.1109/TVCG.2023.3340679
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Visual Question Answering on 3D Point Cloud (VQA-3D) is an emerging yet challenging field that aims at answering various types of textual questions given an entire point cloud scene. To tackle this problem, we propose the CLEVR3D, a large-scale VQA-3D dataset consisting of 171K questions from 8,771 3D scenes. Specifically, we develop a question engine leveraging 3D scene graph structures to generate diverse reasoning questions, covering the questions of objects' attributes (i.e., size, color, and material) and their spatial relationships. Through such a manner, we initially generated 44K questions from 1,333 real-world scenes. Moreover, a more challenging setup is proposed to remove the confounding bias and adjust the context from a common-sense layout. Such a setup requires the network to achieve comprehensive visual understanding when the 3D scene is different from the general co-occurrence context (e.g., chairs always exist with tables). To this end, we further introduce the compositional scene manipulation strategy and generate 127K questions from 7,438 augmented 3D scenes, which can improve VQA-3D models for real-world comprehension. Built upon the proposed dataset, we baseline several VQA-3D models, where experimental results verify that the CLEVR3D can significantly boost other 3D scene understanding tasks.
引用
收藏
页码:7473 / 7485
页数:13
相关论文
共 48 条
  • [21] Handling language prior and compositional reasoning issues in Visual Question Answering system
    Chowdhury, Souvik
    Soni, Badal
    NEUROCOMPUTING, 2025, 635
  • [22] Transductive Cross-Lingual Scene-Text Visual Question Answering
    Li, Lin
    Zhang, Haohan
    Fang, Zeqin
    Xie, Zhongwei
    Liu, Jianquan
    NEURAL INFORMATION PROCESSING, ICONIP 2023, PT VI, 2024, 14452 : 452 - 467
  • [23] Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering
    Koner, Rajat
    Li, Hang
    Hildebrandt, Marcel
    Das, Deepan
    Tresp, Volker
    Guennemann, Stephan
    SEMANTIC WEB - ISWC 2021, 2021, 12922 : 111 - 127
  • [24] Knowledge enhancement and scene understanding for knowledge-based visual question answering
    Su, Zhenqiang
    Gou, Gang
    KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (03) : 2193 - 2208
  • [25] Knowledge enhancement and scene understanding for knowledge-based visual question answering
    Zhenqiang Su
    Gang Gou
    Knowledge and Information Systems, 2024, 66 : 2193 - 2208
  • [26] Multimodal grid features and cell pointers for scene text visual question answering
    Gomez, Lluis
    Biten, Ali Furkan
    Tito, Ruben
    Mafla, Andres
    Rusinol, Marcal
    Valveny, Ernest
    Karatzas, Dimosthenis
    PATTERN RECOGNITION LETTERS, 2021, 150 : 242 - 249
  • [27] Question-aware dynamic scene graph of local semantic representation learning for visual question answering
    Wu, Jinmeng
    Ge, Fulin
    Hong, Hanyu
    Shi, Yu
    Hao, Yanbin
    Ma, Lei
    PATTERN RECOGNITION LETTERS, 2023, 170 : 93 - 99
  • [28] Visual explainable artificial intelligence for graph-based visual question answering and scene graph curation
    Sebastian Künzel
    Tanja Munz-Körner
    Pascal Tilli
    Noel Schäfer
    Sandeep Vidyapu
    Ngoc Thang Vu
    Daniel Weiskopf
    Visual Computing for Industry, Biomedicine, and Art, 8 (1)
  • [29] Visual question answering based on local-scene-aware referring expression generation
    Kim, Jung-Jun
    Lee, Dong-Gyu
    Wu, Jialin
    Jung, Hong-Gyu
    Lee, Seong-Whan
    NEURAL NETWORKS, 2021, 139 (139) : 158 - 167
  • [30] DSGEM: Dual scene graph enhancement module-based visual question answering
    Wang, Boyue
    Ma, Yujian
    Li, Xiaoyan
    Liu, Heng
    Hu, Yongli
    Yin, Baocai
    IET COMPUTER VISION, 2023, 17 (06) : 638 - 651