Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning

被引:5
|
作者
Li, Wenrui [1 ]
Ma, Zhengyu [2 ]
Deng, Liang-Jian [3 ]
Man, Hengyu [1 ]
Fan, Xiaopeng [1 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Univ Elect Sci & Technol China, Hefei, Anhui, Peoples R China
来源
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME | 2023年
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
audio-visual learning; spiking neural network; transformer;
D O I
10.1109/ICME55011.2023.00080
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-visual zero-shot learning (ZSL), which learns to classify video data from the classes not being observed during training, is challenging. In audio-visual ZSL, both semantic and temporal information from different modalities is relevant to each other. However, effectively extracting and fusing information from audio and visual remains an open challenge. In this work, we propose an Audio-Visual Modality-fusion Spiking Transformer network (AVMST) for audio-visual ZSL. To be more specific, AVMST provides a spiking neural network (SNN) module for extracting conspicuous temporal information of each modality, a cross-attention block to effectively fuse the temporal and semantic information, and a transformer reasoning module to further explore the interrelationships of fusion features. To provide robust temporal features, the spiking threshold of the SNN module is adjusted dynamically based on the semantic cues of different modalities. The generated feature map is in accordance with the zero-shot learning property thanks to our proposed spiking transformer's ability to combine the robustness of SNN feature extraction and the precision of transformer feature inference. Extensive experiments on three benchmark audiovisual datasets (i.e., VGGSound, UCF and ActivityNet) validate that the proposed AVMST outperforms existing state-of-the-art methods by a significant margin. The code and pre-trained models are available at https://github.com/liwr-hit/ICME23_AVMST.
引用
收藏
页码:426 / 431
页数:6
相关论文
共 50 条
  • [21] BI-DIRECTIONAL MODALITY FUSION NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION
    Liu, Shuo
    Quan, Weize
    Liu, Yuan
    Yan, Dong-Ming
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4868 - 4872
  • [22] Structure Fusion and Propagation for Zero-Shot Learning
    Lin, Guangfeng
    Chen, Yajun
    Zhao, Fan
    PATTERN RECOGNITION AND COMPUTER VISION, PT III, 2018, 11258 : 465 - 477
  • [23] Visual-Semantic Aligned Bidirectional Network for Zero-Shot Learning
    Gao, Rui
    Hou, Xingsong
    Qin, Jie
    Shen, Yuming
    Long, Yang
    Liu, Li
    Zhang, Zhao
    Shao, Ling
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1649 - 1664
  • [24] Semantic-Visual Combination Propagation Network for Zero-Shot Learning
    Song, Wenli
    Zhang, Lei
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2022, 69 (04) : 2341 - 2345
  • [25] Multidomain Features Fusion for Zero-Shot Learning
    Liu, Zhihao
    Zeng, Zhigang
    Lian, Cheng
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2020, 4 (06): : 764 - 773
  • [26] Zero-Shot Learning via Visual Abstraction
    Antol, Stanislaw
    Zitnick, C. Lawrence
    Parikh, Devi
    COMPUTER VISION - ECCV 2014, PT IV, 2014, 8692 : 401 - 416
  • [27] Continuous Phoneme Recognition based on Audio-Visual Modality Fusion
    Richter, Julius
    Liebold, Jeanine
    Gerkamnn, Timo
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [28] Transformer-Prompted Network: Efficient Audio-Visual Segmentation via Transformer and Prompt Learning
    Wang, Yusen
    Qian, Xiaohong
    Zhou, Wujie
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 516 - 520
  • [29] Zero-shot learning with regularized cross-modality ranking
    Yu, Yunlong
    Ji, Zhong
    Guo, Jichang
    Pang, Yanwei
    NEUROCOMPUTING, 2017, 259 : 14 - 20
  • [30] Infrared colorization with cross-modality zero-shot learning
    Wei, Chiheng
    Chen, Huawei
    Bai, Lianfa
    Han, Jing
    Chen, Xiaoyu
    NEUROCOMPUTING, 2024, 579