TVQA: Localized, Compositional Video Question Answering

被引:0
|
作者
Lei, Jie [1 ]
Yu, Licheng [1 ]
Bansal, Mohit [1 ]
Berg, Tamara L. [1 ]
机构
[1] Univ N Carolina, Dept Comp Sci, Chapel Hill, NC 27515 USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a largescale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at http://tvqa.cs.unc.edu.
引用
收藏
页码:1369 / 1379
页数:11
相关论文
共 50 条
  • [41] Remember and forget: video and text fusion for video question answering
    Feng Gao
    Yuanyuan Ge
    Yongge Liu
    Multimedia Tools and Applications, 2018, 77 : 29269 - 29282
  • [42] Multimodal Graph Networks for Compositional Generalization in Visual Question Answering
    Saqur, Raeid
    Narasimhan, Karthik
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [43] Compositional Task-Oriented Parsing as Abstractive Question Answering
    Zhao, Wenting
    Arkoudas, Konstantine
    Sun, Weiqi
    Cardie, Claire
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4418 - 4427
  • [44] Grounded Graph Decoding Improves Compositional Generalization in Question Answering
    Gai, Yu
    Jain, Paras
    Zhang, Wendi
    Gonzalez, Joseph
    Song, Dawn
    Stoica, Ion
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 1829 - 1838
  • [45] Video question answering via traffic knowledge database and question classification
    Xiaoyong Sun
    Yu Dai
    Yuchen Wang
    Weifeng Ma
    Xuefen Lin
    Multimedia Systems, 2024, 30
  • [46] Video question answering via traffic knowledge database and question classification
    Sun, Xiaoyong
    Dai, Yu
    Wang, Yuchen
    Ma, Weifeng
    Lin, Xuefen
    MULTIMEDIA SYSTEMS, 2024, 30 (01)
  • [47] Question Difficulty Estimation with Directional Modality Association in Video Question Answering
    Kim, Bong-Min
    Park, Seong-Bae
    ADVANCES AND TRENDS IN ARTIFICIAL INTELLIGENCE: THEORY AND PRACTICES IN ARTIFICIAL INTELLIGENCE, 2022, 13343 : 287 - 299
  • [48] Learning Question-Guided Video Representation for Multi-Turn Video Question Answering
    Chao, Guan-Lin
    Rastogi, Abhinav
    Yavuz, Semih
    Hakkani-Tur, Dilek
    Chen, Jindong
    Lane, Ian
    20TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2019), 2019, : 215 - 225
  • [49] ViLA: Efficient Video-Language Alignment for Video Question Answering
    Wang, Xijun
    Liang, Junbang
    Wang, Chun-Kai
    Deng, Kenan
    Lou, Yu
    Lin, Ming C.
    Yang, Shan
    COMPUTER VISION - ECCV 2024, PT LXII, 2025, 15120 : 186 - 204
  • [50] Knowledge Proxy Intervention for Deconfounded Video Question Answering
    Li, Jiangtong
    Niu, Li
    Zhang, Liqing
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2770 - 2781