From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering

被引:17
|
作者
Li, Jiangtong [1 ]
Niu, Li [1 ]
Zhang, Liqing [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, MoE Key Lab Artificial Intelligence, Shanghai, Peoples R China
基金
国家重点研发计划; 美国国家科学基金会;
关键词
D O I
10.1109/CVPR52688.2022.02059
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video understanding has achieved great success in representation learning, such as video caption, video object grounding, and video descriptive question-answer. However, current methods still struggle on video reasoning, including evidence reasoning and commonsense reasoning. To facilitate deeper video understanding towards video reasoning, we present the task of Causal-VidQA, which includes four types of questions ranging from scene description (description) to evidence reasoning (explanation) and commonsense reasoning (prediction and counterfactual). For commonsense reasoning, we set up a two-step solution by answering the question and providing a proper reason. Through extensive experiments on existing VideoQA methods, we find that the state-of-the-art methods are strong in descriptions but weak in reasoning. We hope that Causal-VidQA can guide the research of video understanding from representation learning to deeper reasoning. The dataset and related resources are available at https://github.com/bcmi/Causal-VidQA.git.
引用
收藏
页码:21241 / 21250
页数:10
相关论文
共 50 条
  • [41] Chain of Reasoning for Visual Question Answering
    Wu, Chenfei
    Liu, Jinlai
    Wang, Xiaojie
    Dong, Xuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [42] Question-Answering with Logic Specific to Video Games
    Dumont, Corentin
    Tian, Ran
    Inui, Kentaro
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 4637 - 4643
  • [43] GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning
    Chen, Jiaqi
    Tang, Jianheng
    Qin, Jinghui
    Liang, Xiaodan
    Liu, Lingbo
    Xing, Eric P.
    Lin, Liang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 513 - 523
  • [44] Multi-Modal Correlated Network with Emotional Reasoning Knowledge for Social Intelligence Question-Answering
    Xie, Baijun
    Park, Chung Hyuk
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 3067 - 3073
  • [45] Mineral question-answering system in Chinese based on multi-hop reasoning in knowledge graphs
    Ji, Xiaohui
    Dong, Yuhang
    Yang, Zhongji
    Yang, Mei
    He, Mingyue
    Wang, Yuzhu
    Earth Science Frontiers, 2024, 31 (04) : 37 - 46
  • [46] Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering
    Yu, Weijiang
    Zheng, Haoteng
    Li, Mengfei
    Ji, Lei
    Wu, Lijun
    Xiao, Nong
    Duan, Nan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [47] DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
    Wang, Jianyu
    Bao, Bing-Kun
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 24 : 3369 - 3380
  • [48] LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering
    Jiang, Jingjing
    Liu, Ziyi
    Zheng, Nanning
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5002 - 5013
  • [49] HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering
    Liu, Fei
    Liu, Jing
    Wang, Weining
    Lu, Hanqing
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1678 - 1687
  • [50] ReGR: Relation-aware graph reasoning framework for video question answering
    Wang, Zheng
    Li, Fangtao
    Ota, Kaoru
    Dong, Mianxiong
    Wu, Bin
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (04)