TVQA: Localized, Compositional Video Question Answering

被引：0

作者：

Lei, Jie ^{[1
]}

Yu, Licheng ^{[1
]}

Bansal, Mohit ^{[1
]}

Berg, Tamara L. ^{[1
]}

机构：

[1] Univ N Carolina, Dept Comp Sci, Chapel Hill, NC 27515 USA

来源：

2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018) | 2018年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a largescale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at http://tvqa.cs.unc.edu.

引用

页码：1369 / 1379

页数：11

共 50 条

[41] Remember and forget: video and text fusion for video question answering
Feng Gao
Yuanyuan Ge
Yongge Liu
Multimedia Tools and Applications, 2018, 77 : 29269 - 29282
[42] Multimodal Graph Networks for Compositional Generalization in Visual Question Answering
Saqur, Raeid
Narasimhan, Karthik
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[43] Compositional Task-Oriented Parsing as Abstractive Question Answering
Zhao, Wenting
Arkoudas, Konstantine
Sun, Weiqi
Cardie, Claire
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4418 - 4427
[44] Grounded Graph Decoding Improves Compositional Generalization in Question Answering
Gai, Yu
Jain, Paras
Zhang, Wendi
Gonzalez, Joseph
Song, Dawn
Stoica, Ion
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 1829 - 1838
[45] Video question answering via traffic knowledge database and question classification
Xiaoyong Sun
Yu Dai
Yuchen Wang
Weifeng Ma
Xuefen Lin
Multimedia Systems, 2024, 30
[46] Video question answering via traffic knowledge database and question classification
Sun, Xiaoyong
Dai, Yu
Wang, Yuchen
Ma, Weifeng
Lin, Xuefen
MULTIMEDIA SYSTEMS, 2024, 30 (01)
[47] Question Difficulty Estimation with Directional Modality Association in Video Question Answering
Kim, Bong-Min
Park, Seong-Bae
ADVANCES AND TRENDS IN ARTIFICIAL INTELLIGENCE: THEORY AND PRACTICES IN ARTIFICIAL INTELLIGENCE, 2022, 13343 : 287 - 299
[48] Learning Question-Guided Video Representation for Multi-Turn Video Question Answering
Chao, Guan-Lin
Rastogi, Abhinav
Yavuz, Semih
Hakkani-Tur, Dilek
Chen, Jindong
Lane, Ian
20TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2019), 2019, : 215 - 225
[49] ViLA: Efficient Video-Language Alignment for Video Question Answering
Wang, Xijun
Liang, Junbang
Wang, Chun-Kai
Deng, Kenan
Lou, Yu
Lin, Ming C.
Yang, Shan
COMPUTER VISION - ECCV 2024, PT LXII, 2025, 15120 : 186 - 204
[50] Knowledge Proxy Intervention for Deconfounded Video Question Answering
Li, Jiangtong
Niu, Li
Zhang, Liqing
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2770 - 2781

← 1 2 3 4 5 →