ENHANCING AUDIO-VISUAL QUESTION ANSWERING WITH MISSING MODALITY VIA TRANS-MODAL ASSOCIATIVE LEARNING

被引：0

作者：

Park, Kyu Ri ^{[1
]}

Oh, Youngmin ^{[1
]}

Kim, Jung Uk ^{[2
]}

机构：

[1] Kyung Hee Univ, Dept Artificial Intelligence, Seoul, South Korea

[2] Kyung Hee Univ, Dept Comp Sci & Engn, Seoul, South Korea

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年

基金：

新加坡国家研究基金会;

关键词：

Missing modality; trans-modal association; audio-visual question answering; memory network;

D O I：

10.1109/ICASSP48485.2024.10446292

中图分类号：

学科分类号：

摘要：

We present a novel method for Audio-Visual Question Answering (AVQA) in real-world scenarios where one modality (audio or visual) can be missing. Inspired by human cognitive processes, we introduce a Trans-Modal Associative (TMA) memory that recalls missing modal information (i.e., pseudo modal feature) by establishing associations between available modal features and textual cues. During training phase, we employ a Trans-Modal Recalling (TMR) loss to guide the TMA memory in generating the pseudo modal feature that closely matches the real modal feature. This allows our method to robustly answer the question, even when one modality is missing during inference. We believe that our approach, which effectively copes with missing modalities, can be broadly applied to a variety of multimodal applications.

引用

页码：5755 / 5759

页数：5

共 50 条

[1] Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality
Park, Kyu Ri
Lee, Hong Joo
Kim, Jung Uk
COMPUTER VISION - ECCV 2024, PT XV, 2025, 15073 : 42 - 59
[2] AVQA: A Dataset for Audio-Visual Question Answering on Videos
Yang, Pinci
Wang, Xin
Duan, Xuguang
Chen, Hong
Hou, Runze
Jin, Cong
Zhu, Wenwu
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3480 - 3491
[3] COCA: COllaborative CAusal Regularization for Audio-Visual Question Answering
Lao, Mingrui
Pu, Nan
Liu, Yu
He, Kai
Bakker, Erwin M.
Lew, Michael S.
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12995 - 13003
[4] Heterogeneous Interactive Graph Network for Audio-Visual Question Answering
Zhao, Yihan
Xi, Wei
Bai, Gairui
Liu, Xinhui
Zhao, Jizhong
KNOWLEDGE-BASED SYSTEMS, 2024, 300
[5] Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering
Li, Zhangbin
Guo, Dan
Zhou, Jinxing
Zhang, Jing
Wang, Meng
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3306 - 3314
[6] Progressive Spatio-temporal Perception for Audio-Visual Question Answering
Li, Guangyao
Hou, Wenxuan
Hu, Di
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7808 - 7816
[7] Pano-AVQA: Grounded Audio-Visual Question Answering on 360° Videos
Yun, Heeseung
Yu, Youngjae
Yang, Wonsuk
Lee, Kangil
Kim, Gunhee
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2011 - 2021
[8] Multi-Granularity Relational Attention Network for Audio-Visual Question Answering
Li, Linjun
Jin, Tao
Lin, Wang
Jiang, Hao
Pan, Wenwen
Wang, Jian
Xiao, Shuwen
Xia, Yan
Jiang, Weihao
Zhao, Zhou
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 7080 - 7094
[9] Cross-Modal learning for Audio-Visual Video Parsing
Lamba, Jatin
Abhishek
Akula, Jayaprakash
Dabral, Rishabh
Jyothi, Preethi
Ramakrishnan, Ganesh
INTERSPEECH 2021, 2021, : 1937 - 1941
[10] Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning
Chen, Chen
Hu, Yuchen
Zhang, Qiang
Zou, Heqing
Zhu, Beier
Chng, Eng Siong
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12607 - +

← 1 2 3 4 5 →