Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning

被引：5

作者：

Li, Wenrui ^{[1
]}

Ma, Zhengyu ^{[2
]}

Deng, Liang-Jian ^{[3
]}

Man, Hengyu ^{[1
]}

Fan, Xiaopeng ^{[1
]}

机构：

[1] Harbin Inst Technol, Harbin, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

[3] Univ Elect Sci & Technol China, Hefei, Anhui, Peoples R China

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME | 2023年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

audio-visual learning; spiking neural network; transformer;

D O I：

10.1109/ICME55011.2023.00080

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-visual zero-shot learning (ZSL), which learns to classify video data from the classes not being observed during training, is challenging. In audio-visual ZSL, both semantic and temporal information from different modalities is relevant to each other. However, effectively extracting and fusing information from audio and visual remains an open challenge. In this work, we propose an Audio-Visual Modality-fusion Spiking Transformer network (AVMST) for audio-visual ZSL. To be more specific, AVMST provides a spiking neural network (SNN) module for extracting conspicuous temporal information of each modality, a cross-attention block to effectively fuse the temporal and semantic information, and a transformer reasoning module to further explore the interrelationships of fusion features. To provide robust temporal features, the spiking threshold of the SNN module is adjusted dynamically based on the semantic cues of different modalities. The generated feature map is in accordance with the zero-shot learning property thanks to our proposed spiking transformer's ability to combine the robustness of SNN feature extraction and the precision of transformer feature inference. Extensive experiments on three benchmark audiovisual datasets (i.e., VGGSound, UCF and ActivityNet) validate that the proposed AVMST outperforms existing state-of-the-art methods by a significant margin. The code and pre-trained models are available at https://github.com/liwr-hit/ICME23_AVMST.

引用

页码：426 / 431

页数：6

共 50 条

[31] ADAPTIVE MULTI-SCALE SEMANTIC FUSION NETWORK FOR ZERO-SHOT LEARNING
Song, Jing
Peng, Peixi
Zhai, Yunpeng
Zhang, Chong
Tian, Yonghong
2021 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2021,
[32] Attributes learning network for generalized zero-shot learning
Yun, Yu
Wang, Sen
Hou, Mingzhen
Gao, Quanxue
NEURAL NETWORKS, 2022, 150 : 112 - 118
[33] Differential Refinement Network for Zero-Shot Learning
Tian, Yi
Zhang, Yilei
Huang, Yaping
Xu, Wanru
Ding, Zhengming
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (03) : 4164 - 4178
[34] Zero-Shot Federated Learning with New Classes for Audio Classification
Gudur, Gautham Krishna
Perepu, Satheesh Kumar
INTERSPEECH 2021, 2021, : 1579 - 1583
[35] Visual-semantic consistency matching network for generalized zero-shot learning
Zhang, Zhenqi
Cao, Wenming
NEUROCOMPUTING, 2023, 536 : 30 - 39
[36] A Siamese Transformer Network for Zero-Shot Ancient Coin Classification
Guo, Zhongliang
Arandjelovic, Ognjen
Reid, David
Lei, Yaxiong
Buettner, Jochen
JOURNAL OF IMAGING, 2023, 9 (06)
[37] Adaptive Fusion Learning for Compositional Zero-Shot Recognition
Min, Lingtong
Fan, Ziman
Wang, Shunzhou
Dou, Feiyang
Li, Xin
Wang, Binglu
IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 1193 - 1204
[38] Instance-wise multi-view visual fusion for zero-shot learning
Tang, Long
Zhao, Jingtao
Tian, Yingjie
Yao, Changhua
Pardalos, Panos M.
APPLIED SOFT COMPUTING, 2024, 167
[39] Learning Invariant Visual Representations for Compositional Zero-Shot Learning
Zhang, Tian
Liang, Kongming
Du, Ruoyi
Sun, Xian
Ma, Zhanyu
Guo, Jun
COMPUTER VISION, ECCV 2022, PT XXIV, 2022, 13684 : 339 - 355
[40] Zero-shot recognition with latent visual attributes learning
Xie, Yurui
He, Xiaohai
Zhang, Jing
Luo, Xiaodong
MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (37-38) : 27321 - 27335

← 1 2 3 4 5 →