Question action relevance and editing for visual question answering

被引：11

作者：

Toor, Andeep S. ^{[1
]}

Wechsler, Harry ^{[1
]}

Nappi, Michele ^{[2
]}

机构：

[1] George Mason Univ, Dept Comp Sci, Fairfax, VA 22030 USA

[2] Univ Salerno, Dipartimento Informat, Fisciano, Italy

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2019年 / 78卷 / 03期

关键词：

Computer vision; Visual question answering; Deep learning; Action recognition; Image understanding; Question relevance;

D O I：

10.1007/s11042-018-6097-z

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Visual Question Answering (VQA) expands on the Turing Test, as it involves the ability to answer questions about visual content. Current efforts in VQA, however, still do not fully consider whether a question about visual content is relevant and if it is not, how to edit it best to make it answerable. Question relevance has only been considered so far at the level of a whole question using binary classification and without the capability to edit a question to make it grounded and intelligible. The only exception to this is our prior research effort into question part relevance that allows for relevance and editing based on object nouns. This paper extends previous work on object relevance to determine the relevance for a question action and leverage this capability to edit an irrelevant question to make it relevant. Practical applications of such a capability include answering biometric-related queries across a set of images, including people and their action (behavioral biometrics). The feasibility of our approach is shown using Context-Collaborative VQA (C2VQA) Action/Relevance/Edit (ARE). Our results show that our proposed approach outperforms all other models for the novel tasks of question action relevance (QAR) and question action editing (QAE) by a significant margin. The ultimate goal for future research is to address full-fledged W5 + type of inquires (What, Where, When, Why, Who, and How) that are grounded to and reference video using both nouns and verbs in a collaborative context-aware fashion.

引用

页码：2921 / 2935

页数：15

共 50 条

[31] An Analysis of Visual Question Answering Algorithms
Kafle, Kushal
Kanan, Christopher
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1983 - 1991
[32] Multimodal Attention for Visual Question Answering
Kodra, Lorena
Mece, Elinda Kajo
INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
[33] Affective Visual Question Answering Network
Ruwa, Nelson
Mao, Qirong
Wang, Liangjun
Dong, Ming
IEEE 1ST CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2018), 2018, : 170 - 173
[34] Visual Question Answering on 360° Images
Chou, Shih-Han
Chao, Wei-Lun
Lai, Wei-Sheng
Sun, Min
Yang, Ming-Hsuan
2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1596 - 1605
[35] Medical visual question answering: A survey
Lin, Zhihong
Zhang, Donghao
Tao, Qingyi
Shi, Danli
Haffari, Gholamreza
Wu, Qi
He, Mingguang
Ge, Zongyuan
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 143
[36] Chain of Reasoning for Visual Question Answering
Wu, Chenfei
Liu, Jinlai
Wang, Xiaojie
Dong, Xuan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[37] Visual Question Answering as Reading Comprehension
Li, Hui
Wang, Peng
Shen, Chunhua
van den Hengel, Anton
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6312 - 6321
[38] Revisiting Visual Question Answering Baselines
Jabri, Allan
Joulin, Armand
van der Maaten, Laurens
COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 : 727 - 739
[39] Answer Distillation for Visual Question Answering
Fang, Zhiwei
Liu, Jing
Tang, Qu
Li, Yong
Lu, Hanqing
COMPUTER VISION - ACCV 2018, PT I, 2019, 11361 : 72 - 87
[40] iVQA: Inverse Visual Question Answering
Liu, Feng
Xiang, Tao
Hospedales, Timothy M.
Yang, Wankou
Sun, Changyin
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8611 - 8619

← 1 2 3 4 5 →