Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation

被引:0
|
作者
Yan, Xu [1 ]
Yuan, Zhihao [1 ]
Du, Yuhao [1 ]
Liao, Yinghong [1 ]
Guo, Yao [2 ]
Cui, Shuguang [1 ]
Li, Zhen [1 ]
机构
[1] Chinese Univ Hong Kong, Future Network Intelligence Inst, Sch Sci & Engn, Shenzhen 518172, Peoples R China
[2] Shanghai Jiao Tong Univ, Inst Med Robot, Shanghai 200240, Peoples R China
关键词
Three-dimensional displays; Image color analysis; Visualization; Task analysis; Engines; Solid modeling; Semantics; Deep learning; vision and language; 3D visual question answering; point cloud processing;
D O I
10.1109/TVCG.2023.3340679
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Visual Question Answering on 3D Point Cloud (VQA-3D) is an emerging yet challenging field that aims at answering various types of textual questions given an entire point cloud scene. To tackle this problem, we propose the CLEVR3D, a large-scale VQA-3D dataset consisting of 171K questions from 8,771 3D scenes. Specifically, we develop a question engine leveraging 3D scene graph structures to generate diverse reasoning questions, covering the questions of objects' attributes (i.e., size, color, and material) and their spatial relationships. Through such a manner, we initially generated 44K questions from 1,333 real-world scenes. Moreover, a more challenging setup is proposed to remove the confounding bias and adjust the context from a common-sense layout. Such a setup requires the network to achieve comprehensive visual understanding when the 3D scene is different from the general co-occurrence context (e.g., chairs always exist with tables). To this end, we further introduce the compositional scene manipulation strategy and generate 127K questions from 7,438 augmented 3D scenes, which can improve VQA-3D models for real-world comprehension. Built upon the proposed dataset, we baseline several VQA-3D models, where experimental results verify that the CLEVR3D can significantly boost other 3D scene understanding tasks.
引用
收藏
页码:7473 / 7485
页数:13
相关论文
共 48 条
  • [1] Scene Text Visual Question Answering
    Biten, Ali Furkan
    Tito, Ruben
    Mafla, Andres
    Gomez, Lluis
    Rusinol, Marcal
    Valveny, Ernest
    Jawahar, C. V.
    Karatzas, Dimosthenis
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4290 - 4300
  • [2] Compositional Substitutivity of Visual Reasoning for Visual Question Answering
    Li, Chuanhao
    Li, Zhen
    Jing, Chenchen
    Wu, Yuwei
    Zhai, Mingliang
    Jia, Yunde
    COMPUTER VISION - ECCV 2024, PT XLVIII, 2025, 15106 : 143 - 160
  • [3] Maintaining Reasoning Consistency in Compositional Visual Question Answering
    Jing, Chenchen
    Jia, Yunde
    Wu, Yuwei
    Liu, Xinyu
    Wu, Qi
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5089 - 5098
  • [4] A Multilingual Approach to Scene Text Visual Question Answering
    Brugues i Pujolras, Josep
    Gomez i Bigorda, Llufs
    Karatzas, Dimosthenis
    DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 65 - 79
  • [5] Lightweight Visual Question Answering using Scene Graphs
    Nuthalapati, Sai Vidyaranya
    Chandradevan, Ramraj
    Giunchiglia, Eleonora
    Li, Bowen
    Kayser, Maxime
    Lukasiewicz, Thomas
    Yang, Carl
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3353 - 3357
  • [6] Scene Graph Refinement Network for Visual Question Answering
    Qian, Tianwen
    Chen, Jingjing
    Chen, Shaoxiang
    Wu, Bo
    Jiang, Yu-Gang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 3950 - 3961
  • [7] Visual Causal Scene Refinement for Video Question Answering
    Wei, Yushen
    Liu, Yang
    Yan, Hong
    Li, Guanbin
    Lin, Liang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 377 - 386
  • [8] Multimodal Graph Networks for Compositional Generalization in Visual Question Answering
    Saqur, Raeid
    Narasimhan, Karthik
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [9] Scene text visual question answering by using YOLO and STN
    Nourali K.
    Dolkhani E.
    International Journal of Speech Technology, 2024, 27 (01) : 69 - 76
  • [10] Towards Reasoning Ability in Scene Text Visual Question Answering
    Wang, Qingqing
    Xiao, Liqiang
    Lu, Yue
    Jin, Yaohui
    He, Hao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2281 - 2289