ENHANCING AUDIO-VISUAL QUESTION ANSWERING WITH MISSING MODALITY VIA TRANS-MODAL ASSOCIATIVE LEARNING

被引:0
|
作者
Park, Kyu Ri [1 ]
Oh, Youngmin [1 ]
Kim, Jung Uk [2 ]
机构
[1] Kyung Hee Univ, Dept Artificial Intelligence, Seoul, South Korea
[2] Kyung Hee Univ, Dept Comp Sci & Engn, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
Missing modality; trans-modal association; audio-visual question answering; memory network;
D O I
10.1109/ICASSP48485.2024.10446292
中图分类号
学科分类号
摘要
We present a novel method for Audio-Visual Question Answering (AVQA) in real-world scenarios where one modality (audio or visual) can be missing. Inspired by human cognitive processes, we introduce a Trans-Modal Associative (TMA) memory that recalls missing modal information (i.e., pseudo modal feature) by establishing associations between available modal features and textual cues. During training phase, we employ a Trans-Modal Recalling (TMR) loss to guide the TMA memory in generating the pseudo modal feature that closely matches the real modal feature. This allows our method to robustly answer the question, even when one modality is missing during inference. We believe that our approach, which effectively copes with missing modalities, can be broadly applied to a variety of multimodal applications.
引用
收藏
页码:5755 / 5759
页数:5
相关论文
共 50 条
  • [31] Dual modality prompt learning for visual question-grounded answering in robotic surgery
    Zhang, Yue
    Fan, Wanshu
    Peng, Peixi
    Yang, Xin
    Zhou, Dongsheng
    Wei, Xiaopeng
    VISUAL COMPUTING FOR INDUSTRY BIOMEDICINE AND ART, 2024, 7 (01)
  • [32] Audio-visual Speaker Recognition via Multi-modal Correlated Neural Networks
    Geng, Jiajia
    Liu, Xin
    Cheung, Yiu-ming
    2016 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE WORKSHOPS (WIW 2016), 2016, : 123 - 128
  • [33] Enhancing Visual Question Answering through Bi-Modal Feature Fusion: Performance Analysis
    Mao, Keyu
    6TH INTERNATIONAL CONFERENCE ON IMAGE PROCESSING AND MACHINE VISION, IPMV 2024, 2024, : 115 - 122
  • [34] Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization
    Bao, Peijun
    Yang, Wenhan
    Boon Poh Ng
    Er, Meng Hwa
    Kot, Alex C.
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 215 - 222
  • [35] Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation
    Wang, Xiaoyu
    Kong, Xiangyu
    Peng, Xiulian
    Lu, Yan
    INTERSPEECH 2022, 2022, : 886 - 890
  • [36] Temporal and Cross-modal Attention for Audio-Visual Zero-Shot Learning
    Mercea, Otniel-Bogdan
    Hummel, Thomas
    Koepke, A. Sophia
    Akata, Zeynep
    COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 488 - 505
  • [37] Transfer Learning via Unsupervised Task Discovery for Visual Question Answering
    Noh, Hyeonwoo
    Kim, Taehoon
    Mun, Jonghwan
    Han, Bohyung
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8377 - 8386
  • [38] Medical Visual Question Answering via Conditional Reasoning and Contrastive Learning
    Liu, Bo
    Zhan, Li-Ming
    Xu, Li
    Wu, Xiao-Ming
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (05) : 1532 - 1545
  • [39] A novel task and methods to evaluate inter-individual variation in audio-visual associative learning
    Pasqualotto, Angela
    Cochrane, Aaron
    Bavelier, Daphne
    Altarelli, Irene
    COGNITION, 2024, 242
  • [40] Jointly Learning Attentions with Semantic Cross-Modal Correlation for Visual Question Answering
    Cao, Liangfu
    Gao, Lianli
    Song, Jingkuan
    Xu, Xing
    Shen, Heng Tao
    DATABASES THEORY AND APPLICATIONS, ADC 2017, 2017, 10538 : 248 - 260