Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning

被引:5
|
作者
Li, Wenrui [1 ]
Ma, Zhengyu [2 ]
Deng, Liang-Jian [3 ]
Man, Hengyu [1 ]
Fan, Xiaopeng [1 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Univ Elect Sci & Technol China, Hefei, Anhui, Peoples R China
来源
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME | 2023年
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
audio-visual learning; spiking neural network; transformer;
D O I
10.1109/ICME55011.2023.00080
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-visual zero-shot learning (ZSL), which learns to classify video data from the classes not being observed during training, is challenging. In audio-visual ZSL, both semantic and temporal information from different modalities is relevant to each other. However, effectively extracting and fusing information from audio and visual remains an open challenge. In this work, we propose an Audio-Visual Modality-fusion Spiking Transformer network (AVMST) for audio-visual ZSL. To be more specific, AVMST provides a spiking neural network (SNN) module for extracting conspicuous temporal information of each modality, a cross-attention block to effectively fuse the temporal and semantic information, and a transformer reasoning module to further explore the interrelationships of fusion features. To provide robust temporal features, the spiking threshold of the SNN module is adjusted dynamically based on the semantic cues of different modalities. The generated feature map is in accordance with the zero-shot learning property thanks to our proposed spiking transformer's ability to combine the robustness of SNN feature extraction and the precision of transformer feature inference. Extensive experiments on three benchmark audiovisual datasets (i.e., VGGSound, UCF and ActivityNet) validate that the proposed AVMST outperforms existing state-of-the-art methods by a significant margin. The code and pre-trained models are available at https://github.com/liwr-hit/ICME23_AVMST.
引用
收藏
页码:426 / 431
页数:6
相关论文
共 50 条
  • [41] Joint Visual and Semantic Optimization for zero-shot learning
    Wu, Hanrui
    Yan, Yuguang
    Chen, Sentao
    Huang, Xiangkang
    Wu, Qingyao
    Ng, Michael K.
    KNOWLEDGE-BASED SYSTEMS, 2021, 215 (215)
  • [42] TransZero: Attribute-Guided Transformer for Zero-Shot Learning
    Chen, Shiming
    Hong, Ziming
    Liu, Yang
    Xie, Guo-Sen
    Sun, Baigui
    Li, Hao
    Peng, Qinmu
    Lu, Ke
    You, Xinge
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 330 - 338
  • [43] Dense Modality Interaction Network for Audio-Visual Event Localization
    Liu, Shuo
    Quan, Weize
    Wang, Chaoqun
    Liu, Yuan
    Liu, Bin
    Yan, Dong-Ming
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2734 - 2748
  • [44] Hyperbolic Visual Embedding Learning for Zero-Shot Recognition
    Liu, Shaoteng
    Chen, Jingjing
    Pan, Liangming
    Ngo, Chong-Wah
    Chua, Tat-Seng
    Jiang, Yu-Gang
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 9270 - 9278
  • [45] Semantically Grounded Visual Embeddings for Zero-Shot Learning
    Nawaz, Shah
    Cavazza, Jacopo
    Del Bue, Alessio
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4588 - 4598
  • [46] Zero-shot recognition with latent visual attributes learning
    Yurui Xie
    Xiaohai He
    Jing Zhang
    Xiaodong Luo
    Multimedia Tools and Applications, 2020, 79 : 27321 - 27335
  • [47] Learning unseen visual prototypes for zero-shot classification
    Li, Xiao
    Fang, Min
    Feng, Dazheng
    Li, Haikun
    Wu, Jinqiao
    KNOWLEDGE-BASED SYSTEMS, 2018, 160 : 176 - 187
  • [48] Transductive Zero-Shot Learning with Visual Structure Constraint
    Wan, Ziyu
    Chen, Dongdong
    Li, Yan
    Yan, Xingguang
    Zhang, Junge
    Yu, Yizhou
    Liao, Jing
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [49] Learning Modality-Invariant Latent Representations for Generalized Zero-shot Learning
    Li, Jingjing
    Jing, Mengmeng
    Zhu, Lei
    Ding, Zhengming
    Lu, Ke
    Yang, Yang
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1348 - 1356
  • [50] Few-Shot Audio-Visual Learning of Environment Acoustics
    Majumder, Sagnik
    Chen, Changan
    Al-Halah, Ziad
    Grauman, Kristen
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,