TVQA: Localized, Compositional Video Question Answering

被引:0
|
作者
Lei, Jie [1 ]
Yu, Licheng [1 ]
Bansal, Mohit [1 ]
Berg, Tamara L. [1 ]
机构
[1] Univ N Carolina, Dept Comp Sci, Chapel Hill, NC 27515 USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a largescale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at http://tvqa.cs.unc.edu.
引用
收藏
页码:1369 / 1379
页数:11
相关论文
共 50 条
  • [1] Measuring Compositional Consistency for Video Question Answering
    Gandhi, Mona
    Gul, Mustafa Omer
    Prakash, Eva
    Grunde-McLaughlin, Madeleine
    Krishna, Ranjay
    Agrawala, Maneesh
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5036 - 5045
  • [2] Compositional attention networks with two-stream fusion for video question answering
    Yu, Ting
    Yu, Jun
    Yu, Zhou
    Tao, Dacheng
    IEEE Transactions on Image Processing, 2020, 29 : 1204 - 1218
  • [3] Compositional Attention Networks With Two-Stream Fusion for Video Question Answering
    Yu, Ting
    Yu, Jun
    Yu, Zhou
    Tao, Dacheng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 1204 - 1218
  • [4] Event Graph Guided Compositional Spatial--Temporal Reasoning for Video Question Answering
    Bai, Ziyi
    Wang, Ruiping
    Gao, Difei
    Chen, Xilin
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1109 - 1121
  • [5] Affective question answering on video
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Gou, Jianping
    NEUROCOMPUTING, 2019, 363 : 125 - 139
  • [6] Video Graph Transformer for Video Question Answering
    Xiao, Junbin
    Zhou, Pan
    Chua, Tat-Seng
    Yan, Shuicheng
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 39 - 58
  • [7] Video Reference: A Video Question Answering Engine
    Gao, Lei
    Li, Guangda
    Zheng, Yan-Tao
    Hong, Richang
    Chua, Tat-Seng
    ADVANCES IN MULTIMEDIA MODELING, PROCEEDINGS, 2010, 5916 : 799 - +
  • [8] Compositional question answering: A divide and conquer approach
    Oh, Hyo-Jung
    Sung, Ki-Youn
    Tang, Myung-Gil
    Myaeng, Sung Hyon
    INFORMATION PROCESSING & MANAGEMENT, 2011, 47 (06) : 808 - 824
  • [9] Neural Compositional Denotational Semantics for Question Answering
    Gupta, Nitish
    Lewis, Mike
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 2152 - 2161
  • [10] Locate Before Answering: Answer Guided Question Localization for Video Question Answering
    Qian, Tianwen
    Cui, Ran
    Chen, Jingjing
    Peng, Pai
    Guo, Xiaowei
    Jiang, Yu-Gang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4554 - 4563