ENHANCING AUDIO-VISUAL QUESTION ANSWERING WITH MISSING MODALITY VIA TRANS-MODAL ASSOCIATIVE LEARNING

被引:0
|
作者
Park, Kyu Ri [1 ]
Oh, Youngmin [1 ]
Kim, Jung Uk [2 ]
机构
[1] Kyung Hee Univ, Dept Artificial Intelligence, Seoul, South Korea
[2] Kyung Hee Univ, Dept Comp Sci & Engn, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
Missing modality; trans-modal association; audio-visual question answering; memory network;
D O I
10.1109/ICASSP48485.2024.10446292
中图分类号
学科分类号
摘要
We present a novel method for Audio-Visual Question Answering (AVQA) in real-world scenarios where one modality (audio or visual) can be missing. Inspired by human cognitive processes, we introduce a Trans-Modal Associative (TMA) memory that recalls missing modal information (i.e., pseudo modal feature) by establishing associations between available modal features and textual cues. During training phase, we employ a Trans-Modal Recalling (TMR) loss to guide the TMA memory in generating the pseudo modal feature that closely matches the real modal feature. This allows our method to robustly answer the question, even when one modality is missing during inference. We believe that our approach, which effectively copes with missing modalities, can be broadly applied to a variety of multimodal applications.
引用
收藏
页码:5755 / 5759
页数:5
相关论文
共 50 条
  • [1] Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality
    Park, Kyu Ri
    Lee, Hong Joo
    Kim, Jung Uk
    COMPUTER VISION - ECCV 2024, PT XV, 2025, 15073 : 42 - 59
  • [2] AVQA: A Dataset for Audio-Visual Question Answering on Videos
    Yang, Pinci
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Hou, Runze
    Jin, Cong
    Zhu, Wenwu
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3480 - 3491
  • [3] COCA: COllaborative CAusal Regularization for Audio-Visual Question Answering
    Lao, Mingrui
    Pu, Nan
    Liu, Yu
    He, Kai
    Bakker, Erwin M.
    Lew, Michael S.
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12995 - 13003
  • [4] Heterogeneous Interactive Graph Network for Audio-Visual Question Answering
    Zhao, Yihan
    Xi, Wei
    Bai, Gairui
    Liu, Xinhui
    Zhao, Jizhong
    KNOWLEDGE-BASED SYSTEMS, 2024, 300
  • [5] Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering
    Li, Zhangbin
    Guo, Dan
    Zhou, Jinxing
    Zhang, Jing
    Wang, Meng
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3306 - 3314
  • [6] Progressive Spatio-temporal Perception for Audio-Visual Question Answering
    Li, Guangyao
    Hou, Wenxuan
    Hu, Di
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7808 - 7816
  • [7] Pano-AVQA: Grounded Audio-Visual Question Answering on 360° Videos
    Yun, Heeseung
    Yu, Youngjae
    Yang, Wonsuk
    Lee, Kangil
    Kim, Gunhee
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2011 - 2021
  • [8] Multi-Granularity Relational Attention Network for Audio-Visual Question Answering
    Li, Linjun
    Jin, Tao
    Lin, Wang
    Jiang, Hao
    Pan, Wenwen
    Wang, Jian
    Xiao, Shuwen
    Xia, Yan
    Jiang, Weihao
    Zhao, Zhou
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 7080 - 7094
  • [9] Cross-Modal learning for Audio-Visual Video Parsing
    Lamba, Jatin
    Abhishek
    Akula, Jayaprakash
    Dabral, Rishabh
    Jyothi, Preethi
    Ramakrishnan, Ganesh
    INTERSPEECH 2021, 2021, : 1937 - 1941
  • [10] Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning
    Chen, Chen
    Hu, Yuchen
    Zhang, Qiang
    Zou, Heqing
    Zhu, Beier
    Chng, Eng Siong
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12607 - +