ENHANCING AUDIO-VISUAL QUESTION ANSWERING WITH MISSING MODALITY VIA TRANS-MODAL ASSOCIATIVE LEARNING

被引:0
|
作者
Park, Kyu Ri [1 ]
Oh, Youngmin [1 ]
Kim, Jung Uk [2 ]
机构
[1] Kyung Hee Univ, Dept Artificial Intelligence, Seoul, South Korea
[2] Kyung Hee Univ, Dept Comp Sci & Engn, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
Missing modality; trans-modal association; audio-visual question answering; memory network;
D O I
10.1109/ICASSP48485.2024.10446292
中图分类号
学科分类号
摘要
We present a novel method for Audio-Visual Question Answering (AVQA) in real-world scenarios where one modality (audio or visual) can be missing. Inspired by human cognitive processes, we introduce a Trans-Modal Associative (TMA) memory that recalls missing modal information (i.e., pseudo modal feature) by establishing associations between available modal features and textual cues. During training phase, we employ a Trans-Modal Recalling (TMR) loss to guide the TMA memory in generating the pseudo modal feature that closely matches the real modal feature. This allows our method to robustly answer the question, even when one modality is missing during inference. We believe that our approach, which effectively copes with missing modalities, can be broadly applied to a variety of multimodal applications.
引用
收藏
页码:5755 / 5759
页数:5
相关论文
共 50 条
  • [41] Robust video question answering via contrastive cross-modality representation learning
    Yang, Xun
    Zeng, Jianming
    Guo, Dan
    Wang, Shanshan
    Dong, Jianfeng
    Wang, Meng
    SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (10)
  • [42] Robust video question answering via contrastive cross-modality representation learning
    Xun YANG
    Jianming ZENG
    Dan GUO
    Shanshan WANG
    Jianfeng DONG
    Meng WANG
    Science China(Information Sciences), 2024, 67 (10) : 211 - 226
  • [43] Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning
    Li, Wenrui
    Ma, Zhengyu
    Deng, Liang-Jian
    Man, Hengyu
    Fan, Xiaopeng
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 426 - 431
  • [44] Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
    Sun, Weixuan
    Zhang, Jiayi
    Wang, Jianyuan
    Liu, Zheyuan
    Zhong, Yiran
    Feng, Tianpeng
    Guo, Yandong
    Zhang, Yanhao
    Barnes, Nick
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6420 - 6429
  • [45] Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
    Sun, Weixuan
    Zhang, Jiayi
    Wang, Jianyuan
    Liu, Zheyuan
    Zhong, Yiran
    Feng, Tianpeng
    Guo, Yandong
    Zhang, Yanhao
    Barnes, Nick
    arXiv, 2023,
  • [46] Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering
    Wang, Yan
    Li, Peize
    Si, Qingyi
    Zhang, Hanwen
    Zang, Wenyu
    Lin, Zheng
    Fu, Peng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (03)
  • [47] Enhancing semantic audio-visual representation learning with supervised multi-scale attention
    Zhang, Jiwei
    Yu, Yi
    Tang, Suhua
    Qi, Guojun
    Wu, Haiyuan
    Hachiya, Hirotaka
    PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (02)
  • [48] CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
    Jiang, Yuanyuan
    Yin, Jianqin
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, : 2581 - 2598
  • [49] Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
    Mercea, Otniel-Bogdan
    Riesch, Lukas
    Koepke, A. Sophia
    Akata, Zeynep
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10543 - 10553
  • [50] Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning
    Zhu, Hao
    Huang, Huaibo
    Li, Yi
    Zheng, Aihua
    He, Ran
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 2362 - 2368