Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation

被引:0
|
作者
Yan, Xu [1 ]
Yuan, Zhihao [1 ]
Du, Yuhao [1 ]
Liao, Yinghong [1 ]
Guo, Yao [2 ]
Cui, Shuguang [1 ]
Li, Zhen [1 ]
机构
[1] Chinese Univ Hong Kong, Future Network Intelligence Inst, Sch Sci & Engn, Shenzhen 518172, Peoples R China
[2] Shanghai Jiao Tong Univ, Inst Med Robot, Shanghai 200240, Peoples R China
关键词
Three-dimensional displays; Image color analysis; Visualization; Task analysis; Engines; Solid modeling; Semantics; Deep learning; vision and language; 3D visual question answering; point cloud processing;
D O I
10.1109/TVCG.2023.3340679
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Visual Question Answering on 3D Point Cloud (VQA-3D) is an emerging yet challenging field that aims at answering various types of textual questions given an entire point cloud scene. To tackle this problem, we propose the CLEVR3D, a large-scale VQA-3D dataset consisting of 171K questions from 8,771 3D scenes. Specifically, we develop a question engine leveraging 3D scene graph structures to generate diverse reasoning questions, covering the questions of objects' attributes (i.e., size, color, and material) and their spatial relationships. Through such a manner, we initially generated 44K questions from 1,333 real-world scenes. Moreover, a more challenging setup is proposed to remove the confounding bias and adjust the context from a common-sense layout. Such a setup requires the network to achieve comprehensive visual understanding when the 3D scene is different from the general co-occurrence context (e.g., chairs always exist with tables). To this end, we further introduce the compositional scene manipulation strategy and generate 127K questions from 7,438 augmented 3D scenes, which can improve VQA-3D models for real-world comprehension. Built upon the proposed dataset, we baseline several VQA-3D models, where experimental results verify that the CLEVR3D can significantly boost other 3D scene understanding tasks.
引用
收藏
页码:7473 / 7485
页数:13
相关论文
共 48 条
  • [41] 3D-SceneCaptioner: Visual Scene Captioning Network for Three-Dimensional Point Clouds
    Yu, Qiang
    Pan, Xianbing
    Xiang, Shiming
    Pan, Chunhong
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2021, PT II, 2021, 13020 : 275 - 286
  • [42] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
    Feng Yan
    Wushouer Silamu
    Yachuang Chai
    Yanbing Li
    Multimedia Tools and Applications, 2024, 83 : 7085 - 7096
  • [43] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
    Yan, Feng
    Silamu, Wushouer
    Chai, Yachuang
    Li, Yanbing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (03) : 7085 - 7096
  • [44] Fair-VQA: Fairness-Aware Visual Question Answering Through Sensitive Attribute Prediction
    Park, Sungho
    Hwang, Sunhee
    Hong, Jongkwang
    Byun, Hyeran
    IEEE ACCESS, 2020, 8 : 215091 - 215099
  • [45] CBench: Demonstrating Comprehensive Evaluation of Question Answering Systems over Knowledge Graphs Through Deep Analysis of Benchmarks
    Orogat, Abdelghny
    El-Roby, Ahmed
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (12): : 2711 - 2714
  • [46] InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring
    Yuan, Zhihao
    Yan, Xu
    Liao, Yinghong
    Zhang, Ruimao
    Wang, Sheng
    Li, Zhen
    Cui, Shuguang
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1771 - 1780
  • [47] Visual Guide to Improving Depth Perception in See-Through Visualization of Laser-Scanned 3D Point Clouds
    Nishimura, Kyouma
    Li, Liang
    Hasegawa, Kyoko
    Okamoto, Atsushi
    Sakano, Yuichi
    Tanaka, Satoshi
    METHODS AND APPLICATIONS FOR MODELING AND SIMULATION OF COMPLEX SYSTEMS, 2019, 1094 : 149 - 160
  • [48] Enhancing Immersive Experiences through 3D Point Cloud Analysis: A Novel Framework for Applying 2D Visual Saliency Models to 3D Point Clouds
    Thba, Marouane
    Zhou, Xuemei
    Viola, Irene
    Cesar, Pablo
    Chetouani, Aladine
    Valenzise, Giuseppe
    Dufaux, Frederic
    2024 16TH INTERNATIONAL CONFERENCE ON QUALITY OF MULTIMEDIA EXPERIENCE, QOMEX 2024, 2024, : 307 - 313