Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning

被引:5
|
作者
Li, Wenrui [1 ]
Ma, Zhengyu [2 ]
Deng, Liang-Jian [3 ]
Man, Hengyu [1 ]
Fan, Xiaopeng [1 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Univ Elect Sci & Technol China, Hefei, Anhui, Peoples R China
来源
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME | 2023年
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
audio-visual learning; spiking neural network; transformer;
D O I
10.1109/ICME55011.2023.00080
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-visual zero-shot learning (ZSL), which learns to classify video data from the classes not being observed during training, is challenging. In audio-visual ZSL, both semantic and temporal information from different modalities is relevant to each other. However, effectively extracting and fusing information from audio and visual remains an open challenge. In this work, we propose an Audio-Visual Modality-fusion Spiking Transformer network (AVMST) for audio-visual ZSL. To be more specific, AVMST provides a spiking neural network (SNN) module for extracting conspicuous temporal information of each modality, a cross-attention block to effectively fuse the temporal and semantic information, and a transformer reasoning module to further explore the interrelationships of fusion features. To provide robust temporal features, the spiking threshold of the SNN module is adjusted dynamically based on the semantic cues of different modalities. The generated feature map is in accordance with the zero-shot learning property thanks to our proposed spiking transformer's ability to combine the robustness of SNN feature extraction and the precision of transformer feature inference. Extensive experiments on three benchmark audiovisual datasets (i.e., VGGSound, UCF and ActivityNet) validate that the proposed AVMST outperforms existing state-of-the-art methods by a significant margin. The code and pre-trained models are available at https://github.com/liwr-hit/ICME23_AVMST.
引用
收藏
页码:426 / 431
页数:6
相关论文
共 50 条
  • [31] ADAPTIVE MULTI-SCALE SEMANTIC FUSION NETWORK FOR ZERO-SHOT LEARNING
    Song, Jing
    Peng, Peixi
    Zhai, Yunpeng
    Zhang, Chong
    Tian, Yonghong
    2021 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2021,
  • [32] Attributes learning network for generalized zero-shot learning
    Yun, Yu
    Wang, Sen
    Hou, Mingzhen
    Gao, Quanxue
    NEURAL NETWORKS, 2022, 150 : 112 - 118
  • [33] Differential Refinement Network for Zero-Shot Learning
    Tian, Yi
    Zhang, Yilei
    Huang, Yaping
    Xu, Wanru
    Ding, Zhengming
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (03) : 4164 - 4178
  • [34] Zero-Shot Federated Learning with New Classes for Audio Classification
    Gudur, Gautham Krishna
    Perepu, Satheesh Kumar
    INTERSPEECH 2021, 2021, : 1579 - 1583
  • [35] Visual-semantic consistency matching network for generalized zero-shot learning
    Zhang, Zhenqi
    Cao, Wenming
    NEUROCOMPUTING, 2023, 536 : 30 - 39
  • [36] A Siamese Transformer Network for Zero-Shot Ancient Coin Classification
    Guo, Zhongliang
    Arandjelovic, Ognjen
    Reid, David
    Lei, Yaxiong
    Buettner, Jochen
    JOURNAL OF IMAGING, 2023, 9 (06)
  • [37] Adaptive Fusion Learning for Compositional Zero-Shot Recognition
    Min, Lingtong
    Fan, Ziman
    Wang, Shunzhou
    Dou, Feiyang
    Li, Xin
    Wang, Binglu
    IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 1193 - 1204
  • [38] Instance-wise multi-view visual fusion for zero-shot learning
    Tang, Long
    Zhao, Jingtao
    Tian, Yingjie
    Yao, Changhua
    Pardalos, Panos M.
    APPLIED SOFT COMPUTING, 2024, 167
  • [39] Learning Invariant Visual Representations for Compositional Zero-Shot Learning
    Zhang, Tian
    Liang, Kongming
    Du, Ruoyi
    Sun, Xian
    Ma, Zhanyu
    Guo, Jun
    COMPUTER VISION, ECCV 2022, PT XXIV, 2022, 13684 : 339 - 355
  • [40] Zero-shot recognition with latent visual attributes learning
    Xie, Yurui
    He, Xiaohai
    Zhang, Jing
    Luo, Xiaodong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (37-38) : 27321 - 27335