Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning

被引：5

作者：

Li, Wenrui ^{[1
]}

Ma, Zhengyu ^{[2
]}

Deng, Liang-Jian ^{[3
]}

Man, Hengyu ^{[1
]}

Fan, Xiaopeng ^{[1
]}

机构：

[1] Harbin Inst Technol, Harbin, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

[3] Univ Elect Sci & Technol China, Hefei, Anhui, Peoples R China

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME | 2023年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

audio-visual learning; spiking neural network; transformer;

D O I：

10.1109/ICME55011.2023.00080

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-visual zero-shot learning (ZSL), which learns to classify video data from the classes not being observed during training, is challenging. In audio-visual ZSL, both semantic and temporal information from different modalities is relevant to each other. However, effectively extracting and fusing information from audio and visual remains an open challenge. In this work, we propose an Audio-Visual Modality-fusion Spiking Transformer network (AVMST) for audio-visual ZSL. To be more specific, AVMST provides a spiking neural network (SNN) module for extracting conspicuous temporal information of each modality, a cross-attention block to effectively fuse the temporal and semantic information, and a transformer reasoning module to further explore the interrelationships of fusion features. To provide robust temporal features, the spiking threshold of the SNN module is adjusted dynamically based on the semantic cues of different modalities. The generated feature map is in accordance with the zero-shot learning property thanks to our proposed spiking transformer's ability to combine the robustness of SNN feature extraction and the precision of transformer feature inference. Extensive experiments on three benchmark audiovisual datasets (i.e., VGGSound, UCF and ActivityNet) validate that the proposed AVMST outperforms existing state-of-the-art methods by a significant margin. The code and pre-trained models are available at https://github.com/liwr-hit/ICME23_AVMST.

引用

页码：426 / 431

页数：6

共 50 条

[1] Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning
Li, Wenrui
Wang, Penghong
Xiong, Ruiqin
Fan, Xiaopeng
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 4840 - 4852
[2] Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning
Li, Wenrui
Zhao, Xi-Le
Ma, Zhengyu
Wang, Xingtao
Fan, Xiaopeng
Tian, Yonghong
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3994 - 4002
[3] Multi-modal spiking tensor regression network for audio-visual zero-shot learning
Yang, Zhe
Li, Wenrui
Hou, Jinxiu
Cheng, Guanghui
NEUROCOMPUTING, 2025, 629
[4] Hyperbolic Audio-visual Zero-shot Learning
Hong, Jie
Hayder, Zeeshan
Han, Junlin
Fang, Pengfei
Harandi, Mehrtash
Petersson, Lars
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 7839 - 7849
[5] Learning semantic consistency for audio-visual zero-shot learning
Xiaoyong Li
Jing Yang
Yuling Chen
Wei Zhang
Xiaoli Ruan
Chengjiang Li
Zhidong Su
Artificial Intelligence Review, 58 (7)
[6] Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning
Zhang, Kaiwen
Zhao, Kunchen
Tian, Yunong
MATHEMATICS, 2024, 12 (14)
[7] Audio-Visual Generalized Zero-Shot Learning the Easy Way
Mo, Shentong
Morgado, Pedro
COMPUTER VISION - ECCV 2024, PT LXXI, 2025, 15129 : 377 - 395
[8] Audio-Visual Generalized Zero-Shot Learning Based on Variational Information Bottleneck
Li, Yapeng
Luo, Yong
Du, Bo
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 450 - 455
[9] Temporal and Cross-modal Attention for Audio-Visual Zero-Shot Learning
Mercea, Otniel-Bogdan
Hummel, Thomas
Koepke, A. Sophia
Akata, Zeynep
COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 488 - 505
[10] Object-Aware Image Augmentation for Audio-Visual Zero-Shot Learning
Dong, Yujie
Chen, Shiming
Duan, Bowen
Ding, Weiping
Wang, Yisong
You, Xinge
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024,

← 1 2 3 4 5 →