Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation

被引:0
|
作者
Yan, Xu [1 ]
Yuan, Zhihao [1 ]
Du, Yuhao [1 ]
Liao, Yinghong [1 ]
Guo, Yao [2 ]
Cui, Shuguang [1 ]
Li, Zhen [1 ]
机构
[1] Chinese Univ Hong Kong, Future Network Intelligence Inst, Sch Sci & Engn, Shenzhen 518172, Peoples R China
[2] Shanghai Jiao Tong Univ, Inst Med Robot, Shanghai 200240, Peoples R China
关键词
Three-dimensional displays; Image color analysis; Visualization; Task analysis; Engines; Solid modeling; Semantics; Deep learning; vision and language; 3D visual question answering; point cloud processing;
D O I
10.1109/TVCG.2023.3340679
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Visual Question Answering on 3D Point Cloud (VQA-3D) is an emerging yet challenging field that aims at answering various types of textual questions given an entire point cloud scene. To tackle this problem, we propose the CLEVR3D, a large-scale VQA-3D dataset consisting of 171K questions from 8,771 3D scenes. Specifically, we develop a question engine leveraging 3D scene graph structures to generate diverse reasoning questions, covering the questions of objects' attributes (i.e., size, color, and material) and their spatial relationships. Through such a manner, we initially generated 44K questions from 1,333 real-world scenes. Moreover, a more challenging setup is proposed to remove the confounding bias and adjust the context from a common-sense layout. Such a setup requires the network to achieve comprehensive visual understanding when the 3D scene is different from the general co-occurrence context (e.g., chairs always exist with tables). To this end, we further introduce the compositional scene manipulation strategy and generate 127K questions from 7,438 augmented 3D scenes, which can improve VQA-3D models for real-world comprehension. Built upon the proposed dataset, we baseline several VQA-3D models, where experimental results verify that the CLEVR3D can significantly boost other 3D scene understanding tasks.
引用
收藏
页码:7473 / 7485
页数:13
相关论文
共 48 条
  • [31] A framework for visual question answering with the integration of scene-text using PHOCs and fisher vectors
    Sharma, Himanshu
    Jalal, Anand Singh
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 190
  • [32] Visual question answering in the medical domain based on deep learning approaches: A comprehensive study
    Al-Sadi, Aisha
    Al-Ayyoub, Mahmoud
    Jararweh, Yaser
    Costen, Fumie
    PATTERN RECOGNITION LETTERS, 2021, 150 : 57 - 75
  • [33] Enhancing scene-text visual question answering with relational reasoning, attention and dynamic vocabulary integration
    Agrawal, Mayank
    Jalal, Anand Singh
    Sharma, Himanshu
    COMPUTATIONAL INTELLIGENCE, 2024, 40 (01)
  • [34] SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering
    Cao, Feiqi
    Luo, Siwen
    Nunez, Felipe
    Wen, Zean
    Poon, Josiah
    Han, Soyeon Caren
    ROBOTICS, 2023, 12 (04)
  • [35] GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering visualreasoning.net
    Hudson, Drew A.
    Manning, Christopher D.
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6693 - 6702
  • [36] MULTI-LAYER CONTENT INTERACTION THROUGH QUATERNION PRODUCT FOR VISUAL QUESTION ANSWERING
    Shi, Lei
    Geng, Shijie
    Shuang, Kai
    Hori, Chiori
    Liu, Songxiang
    Gao, Peng
    Su, Sen
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4412 - 4416
  • [37] Causal Reasoning through Two Cognition Layers for Improving Generalization in Visual Question Answering
    Nguyen, Trang
    Okazaki, Naoaki
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 9221 - 9236
  • [38] Incorporating Verb Semantic Information in Visual Question Answering Through Multitask Learning Paradigm
    Alizadeh, Mehrdad
    Di Eugenio, Barbara
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2020, 14 (02) : 223 - 248
  • [39] Enhancing Visual Question Answering through Bi-Modal Feature Fusion: Performance Analysis
    Mao, Keyu
    6TH INTERNATIONAL CONFERENCE ON IMAGE PROCESSING AND MACHINE VISION, IPMV 2024, 2024, : 115 - 122
  • [40] Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models
    van Sonsbeek, Tom
    Derakhshani, Mohammad Mahdi
    Najdenkoska, Ivona
    Snoek, Cees G. M.
    Worring, Marcel
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT V, 2023, 14224 : 726 - 736