TVQA: Localized, Compositional Video Question Answering

被引:0
|
作者
Lei, Jie [1 ]
Yu, Licheng [1 ]
Bansal, Mohit [1 ]
Berg, Tamara L. [1 ]
机构
[1] Univ N Carolina, Dept Comp Sci, Chapel Hill, NC 27515 USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a largescale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at http://tvqa.cs.unc.edu.
引用
收藏
页码:1369 / 1379
页数:11
相关论文
共 50 条
  • [31] Contrastive Video Question Answering via Video Graph Transformer
    Xiao, Junbin
    Zhou, Pan
    Yao, Angela
    Li, Yicong
    Hong, Richang
    Yan, Shuicheng
    Chua, Tat-Seng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13265 - 13280
  • [32] Uncovering the Temporal Context for Video Question Answering
    Zhu, Linchao
    Xu, Zhongwen
    Yang, Yi
    Hauptmann, Alexander G.
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 124 (03) : 409 - 421
  • [33] Video Question Answering With Semantic Disentanglement and Reasoning
    Liu, Jin
    Wang, Guoxiang
    Xie, Jialong
    Zhou, Fengyu
    Xu, Huijuan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 3663 - 3673
  • [34] Embedding VLAD in Transformer for Video Question Answering
    Guo D.
    Yao S.-T.
    Wang H.
    Wang M.
    Jisuanji Xuebao/Chinese Journal of Computers, 2023, 46 (04): : 671 - 689
  • [35] Question answering on large news video archive
    Chua, TS
    ISPA 2003: PROCEEDINGS OF THE 3RD INTERNATIONAL SYMPOSIUM ON IMAGE AND SIGNAL PROCESSING AND ANALYSIS, PTS 1 AND 2, 2003, : 289 - 294
  • [36] On the hidden treasure of dialog in video question answering
    Engin, Deniz
    Schnitzler, Francois
    Duong, Ngoc Q. K.
    Avrithis, Yannis
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2044 - 2053
  • [37] Video Question Answering: a Survey of Models and Datasets
    Guanglu Sun
    Lili Liang
    Tianlin Li
    Bo Yu
    Meng Wu
    Bolun Zhang
    Mobile Networks and Applications, 2021, 26 : 1904 - 1937
  • [38] Video Question Answering: a Survey of Models and Datasets
    Sun, Guanglu
    Liang, Lili
    Li, Tianlin
    Yu, Bo
    Wu, Meng
    Zhang, Bolun
    MOBILE NETWORKS & APPLICATIONS, 2021, 26 (05): : 1904 - 1937
  • [39] Complementary spatiotemporal network for video question answering
    Xinrui Li
    Aming Wu
    Yahong Han
    Multimedia Systems, 2022, 28 : 161 - 169
  • [40] Complementary spatiotemporal network for video question answering
    Li, Xinrui
    Wu, Aming
    Han, Yahong
    MULTIMEDIA SYSTEMS, 2022, 28 (01) : 161 - 169