ENHANCING AUDIO-VISUAL QUESTION ANSWERING WITH MISSING MODALITY VIA TRANS-MODAL ASSOCIATIVE LEARNING

被引:0
|
作者
Park, Kyu Ri [1 ]
Oh, Youngmin [1 ]
Kim, Jung Uk [2 ]
机构
[1] Kyung Hee Univ, Dept Artificial Intelligence, Seoul, South Korea
[2] Kyung Hee Univ, Dept Comp Sci & Engn, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
Missing modality; trans-modal association; audio-visual question answering; memory network;
D O I
10.1109/ICASSP48485.2024.10446292
中图分类号
学科分类号
摘要
We present a novel method for Audio-Visual Question Answering (AVQA) in real-world scenarios where one modality (audio or visual) can be missing. Inspired by human cognitive processes, we introduce a Trans-Modal Associative (TMA) memory that recalls missing modal information (i.e., pseudo modal feature) by establishing associations between available modal features and textual cues. During training phase, we employ a Trans-Modal Recalling (TMR) loss to guide the TMA memory in generating the pseudo modal feature that closely matches the real modal feature. This allows our method to robustly answer the question, even when one modality is missing during inference. We believe that our approach, which effectively copes with missing modalities, can be broadly applied to a variety of multimodal applications.
引用
收藏
页码:5755 / 5759
页数:5
相关论文
共 50 条
  • [21] ENHANCING CONTRASTIVE LEARNING WITH TEMPORAL COGNIZANCE FOR AUDIO-VISUAL REPRESENTATION GENERATION
    Lavania, Chandrashekhar
    Sundaram, Shiva
    Srinivasan, Sundararajan
    Kirchhoff, Katrin
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4728 - 4732
  • [22] Enhancing Visual Question Answering via Deconstructing Questions and Explicating Answers
    Chen, Feilong
    Han, Minglun
    Shi, Jing
    Xu, Shuang
    Xu, Bo
    INTERSPEECH 2023, 2023, : 3447 - 3451
  • [23] Tackling Missing Modalities in Audio-Visual Representation Learning Using Masked Autoencoders
    Chochlakis, Georgios
    Lavania, Chandrashekhar
    Mathur, Prashant
    Han, Kyu J.
    INTERSPEECH 2024, 2024, : 4678 - 4682
  • [24] Adversarial-Metric Learning for Audio-Visual Cross-Modal Matching
    Zheng, Aihua
    Hu, Menglan
    Jiang, Bo
    Huang, Yan
    Yan, Yan
    Luo, Bin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 338 - 351
  • [25] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
    Sun, Chao
    Chen, Min
    Cheng, Jialiang
    Liang, Han
    Zhu, Chuanbo
    Chen, Jincai
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
  • [26] Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation
    Yang, Chih-Chun
    Fan, Wan-Cyuan
    Yang, Cheng-Fu
    Wang, Yu-Chiang Frank
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3036 - 3044
  • [27] A NOVEL DISTANCE LEARNING FOR ELASTIC CROSS-MODAL AUDIO-VISUAL MATCHING
    Wangrui
    Huang, Huaibo
    Zhang, Xufeng
    Ma, Jixin
    Zheng, Aihua
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2019, : 300 - 305
  • [28] LEARNING AUDIO-VISUAL CORRELATIONS FROM VARIATIONAL CROSS-MODAL GENERATION
    Zhu, Ye
    Wu, Yu
    Latapie, Hugo
    Yang, Yi
    Yan, Yan
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4300 - 4304
  • [29] Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering
    Fu, Ze
    Zheng, Changmeng
    Cai, Yi
    Li, Qing
    Wang, Tao
    WEB AND BIG DATA, APWEB-WAIM 2021, PT I, 2021, 12858 : 316 - 331
  • [30] Enhancing Visual Question Answering with Prompt-based Learning: A Cross-modal Approach for Deep Semantic Understanding
    Zhu, Shuaiyu
    Peng, Shuo
    Chen, Shengbo
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ALGORITHMS, SOFTWARE ENGINEERING, AND NETWORK SECURITY, ASENS 2024, 2024, : 713 - 717