Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning

被引:5
|
作者
Li, Wenrui [1 ]
Ma, Zhengyu [2 ]
Deng, Liang-Jian [3 ]
Man, Hengyu [1 ]
Fan, Xiaopeng [1 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Univ Elect Sci & Technol China, Hefei, Anhui, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
audio-visual learning; spiking neural network; transformer;
D O I
10.1109/ICME55011.2023.00080
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-visual zero-shot learning (ZSL), which learns to classify video data from the classes not being observed during training, is challenging. In audio-visual ZSL, both semantic and temporal information from different modalities is relevant to each other. However, effectively extracting and fusing information from audio and visual remains an open challenge. In this work, we propose an Audio-Visual Modality-fusion Spiking Transformer network (AVMST) for audio-visual ZSL. To be more specific, AVMST provides a spiking neural network (SNN) module for extracting conspicuous temporal information of each modality, a cross-attention block to effectively fuse the temporal and semantic information, and a transformer reasoning module to further explore the interrelationships of fusion features. To provide robust temporal features, the spiking threshold of the SNN module is adjusted dynamically based on the semantic cues of different modalities. The generated feature map is in accordance with the zero-shot learning property thanks to our proposed spiking transformer's ability to combine the robustness of SNN feature extraction and the precision of transformer feature inference. Extensive experiments on three benchmark audiovisual datasets (i.e., VGGSound, UCF and ActivityNet) validate that the proposed AVMST outperforms existing state-of-the-art methods by a significant margin. The code and pre-trained models are available at https://github.com/liwr-hit/ICME23_AVMST.
引用
收藏
页码:426 / 431
页数:6
相关论文
共 50 条
  • [1] Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning
    Li, Wenrui
    Wang, Penghong
    Xiong, Ruiqin
    Fan, Xiaopeng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 4840 - 4852
  • [2] Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning
    Li, Wenrui
    Zhao, Xi-Le
    Ma, Zhengyu
    Wang, Xingtao
    Fan, Xiaopeng
    Tian, Yonghong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3994 - 4002
  • [3] Multi-modal spiking tensor regression network for audio-visual zero-shot learning
    Yang, Zhe
    Li, Wenrui
    Hou, Jinxiu
    Cheng, Guanghui
    NEUROCOMPUTING, 2025, 629
  • [4] Hyperbolic Audio-visual Zero-shot Learning
    Hong, Jie
    Hayder, Zeeshan
    Han, Junlin
    Fang, Pengfei
    Harandi, Mehrtash
    Petersson, Lars
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 7839 - 7849
  • [5] Learning semantic consistency for audio-visual zero-shot learning
    Xiaoyong Li
    Jing Yang
    Yuling Chen
    Wei Zhang
    Xiaoli Ruan
    Chengjiang Li
    Zhidong Su
    Artificial Intelligence Review, 58 (7)
  • [6] Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning
    Zhang, Kaiwen
    Zhao, Kunchen
    Tian, Yunong
    MATHEMATICS, 2024, 12 (14)
  • [7] Audio-Visual Generalized Zero-Shot Learning the Easy Way
    Mo, Shentong
    Morgado, Pedro
    COMPUTER VISION - ECCV 2024, PT LXXI, 2025, 15129 : 377 - 395
  • [8] Audio-Visual Generalized Zero-Shot Learning Based on Variational Information Bottleneck
    Li, Yapeng
    Luo, Yong
    Du, Bo
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 450 - 455
  • [9] Temporal and Cross-modal Attention for Audio-Visual Zero-Shot Learning
    Mercea, Otniel-Bogdan
    Hummel, Thomas
    Koepke, A. Sophia
    Akata, Zeynep
    COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 488 - 505
  • [10] Object-Aware Image Augmentation for Audio-Visual Zero-Shot Learning
    Dong, Yujie
    Chen, Shiming
    Duan, Bowen
    Ding, Weiping
    Wang, Yisong
    You, Xinge
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024,