ENHANCING AUDIO-VISUAL QUESTION ANSWERING WITH MISSING MODALITY VIA TRANS-MODAL ASSOCIATIVE LEARNING

被引：0

作者：

Park, Kyu Ri ^{[1
]}

Oh, Youngmin ^{[1
]}

Kim, Jung Uk ^{[2
]}

机构：

[1] Kyung Hee Univ, Dept Artificial Intelligence, Seoul, South Korea

[2] Kyung Hee Univ, Dept Comp Sci & Engn, Seoul, South Korea

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年

基金：

新加坡国家研究基金会;

关键词：

Missing modality; trans-modal association; audio-visual question answering; memory network;

D O I：

10.1109/ICASSP48485.2024.10446292

中图分类号：

学科分类号：

摘要：

We present a novel method for Audio-Visual Question Answering (AVQA) in real-world scenarios where one modality (audio or visual) can be missing. Inspired by human cognitive processes, we introduce a Trans-Modal Associative (TMA) memory that recalls missing modal information (i.e., pseudo modal feature) by establishing associations between available modal features and textual cues. During training phase, we employ a Trans-Modal Recalling (TMR) loss to guide the TMA memory in generating the pseudo modal feature that closely matches the real modal feature. This allows our method to robustly answer the question, even when one modality is missing during inference. We believe that our approach, which effectively copes with missing modalities, can be broadly applied to a variety of multimodal applications.

引用

页码：5755 / 5759

页数：5

共 50 条

[31] Dual modality prompt learning for visual question-grounded answering in robotic surgery
Zhang, Yue
Fan, Wanshu
Peng, Peixi
Yang, Xin
Zhou, Dongsheng
Wei, Xiaopeng
VISUAL COMPUTING FOR INDUSTRY BIOMEDICINE AND ART, 2024, 7 (01)
[32] Audio-visual Speaker Recognition via Multi-modal Correlated Neural Networks
Geng, Jiajia
Liu, Xin
Cheung, Yiu-ming
2016 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE WORKSHOPS (WIW 2016), 2016, : 123 - 128
[33] Enhancing Visual Question Answering through Bi-Modal Feature Fusion: Performance Analysis
Mao, Keyu
6TH INTERNATIONAL CONFERENCE ON IMAGE PROCESSING AND MACHINE VISION, IPMV 2024, 2024, : 115 - 122
[34] Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization
Bao, Peijun
Yang, Wenhan
Boon Poh Ng
Er, Meng Hwa
Kot, Alex C.
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 215 - 222
[35] Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation
Wang, Xiaoyu
Kong, Xiangyu
Peng, Xiulian
Lu, Yan
INTERSPEECH 2022, 2022, : 886 - 890
[36] Temporal and Cross-modal Attention for Audio-Visual Zero-Shot Learning
Mercea, Otniel-Bogdan
Hummel, Thomas
Koepke, A. Sophia
Akata, Zeynep
COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 488 - 505
[37] Transfer Learning via Unsupervised Task Discovery for Visual Question Answering
Noh, Hyeonwoo
Kim, Taehoon
Mun, Jonghwan
Han, Bohyung
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8377 - 8386
[38] Medical Visual Question Answering via Conditional Reasoning and Contrastive Learning
Liu, Bo
Zhan, Li-Ming
Xu, Li
Wu, Xiao-Ming
IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (05) : 1532 - 1545
[39] A novel task and methods to evaluate inter-individual variation in audio-visual associative learning
Pasqualotto, Angela
Cochrane, Aaron
Bavelier, Daphne
Altarelli, Irene
COGNITION, 2024, 242
[40] Jointly Learning Attentions with Semantic Cross-Modal Correlation for Visual Question Answering
Cao, Liangfu
Gao, Lianli
Song, Jingkuan
Xu, Xing
Shen, Heng Tao
DATABASES THEORY AND APPLICATIONS, ADC 2017, 2017, 10538 : 248 - 260

← 1 2 3 4 5 →