Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning

被引：5

作者：

Li, Wenrui ^{[1
]}

Ma, Zhengyu ^{[2
]}

Deng, Liang-Jian ^{[3
]}

Man, Hengyu ^{[1
]}

Fan, Xiaopeng ^{[1
]}

机构：

[1] Harbin Inst Technol, Harbin, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

[3] Univ Elect Sci & Technol China, Hefei, Anhui, Peoples R China

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME | 2023年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

audio-visual learning; spiking neural network; transformer;

D O I：

10.1109/ICME55011.2023.00080

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-visual zero-shot learning (ZSL), which learns to classify video data from the classes not being observed during training, is challenging. In audio-visual ZSL, both semantic and temporal information from different modalities is relevant to each other. However, effectively extracting and fusing information from audio and visual remains an open challenge. In this work, we propose an Audio-Visual Modality-fusion Spiking Transformer network (AVMST) for audio-visual ZSL. To be more specific, AVMST provides a spiking neural network (SNN) module for extracting conspicuous temporal information of each modality, a cross-attention block to effectively fuse the temporal and semantic information, and a transformer reasoning module to further explore the interrelationships of fusion features. To provide robust temporal features, the spiking threshold of the SNN module is adjusted dynamically based on the semantic cues of different modalities. The generated feature map is in accordance with the zero-shot learning property thanks to our proposed spiking transformer's ability to combine the robustness of SNN feature extraction and the precision of transformer feature inference. Extensive experiments on three benchmark audiovisual datasets (i.e., VGGSound, UCF and ActivityNet) validate that the proposed AVMST outperforms existing state-of-the-art methods by a significant margin. The code and pre-trained models are available at https://github.com/liwr-hit/ICME23_AVMST.

引用

页码：426 / 431

页数：6

共 50 条

[21] BI-DIRECTIONAL MODALITY FUSION NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION
Liu, Shuo
Quan, Weize
Liu, Yuan
Yan, Dong-Ming
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4868 - 4872
[22] Structure Fusion and Propagation for Zero-Shot Learning
Lin, Guangfeng
Chen, Yajun
Zhao, Fan
PATTERN RECOGNITION AND COMPUTER VISION, PT III, 2018, 11258 : 465 - 477
[23] Visual-Semantic Aligned Bidirectional Network for Zero-Shot Learning
Gao, Rui
Hou, Xingsong
Qin, Jie
Shen, Yuming
Long, Yang
Liu, Li
Zhang, Zhao
Shao, Ling
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1649 - 1664
[24] Semantic-Visual Combination Propagation Network for Zero-Shot Learning
Song, Wenli
Zhang, Lei
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2022, 69 (04) : 2341 - 2345
[25] Multidomain Features Fusion for Zero-Shot Learning
Liu, Zhihao
Zeng, Zhigang
Lian, Cheng
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2020, 4 (06): : 764 - 773
[26] Zero-Shot Learning via Visual Abstraction
Antol, Stanislaw
Zitnick, C. Lawrence
Parikh, Devi
COMPUTER VISION - ECCV 2014, PT IV, 2014, 8692 : 401 - 416
[27] Continuous Phoneme Recognition based on Audio-Visual Modality Fusion
Richter, Julius
Liebold, Jeanine
Gerkamnn, Timo
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[28] Transformer-Prompted Network: Efficient Audio-Visual Segmentation via Transformer and Prompt Learning
Wang, Yusen
Qian, Xiaohong
Zhou, Wujie
IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 516 - 520
[29] Zero-shot learning with regularized cross-modality ranking
Yu, Yunlong
Ji, Zhong
Guo, Jichang
Pang, Yanwei
NEUROCOMPUTING, 2017, 259 : 14 - 20
[30] Infrared colorization with cross-modality zero-shot learning
Wei, Chiheng
Chen, Huawei
Bai, Lianfa
Han, Jing
Chen, Xiaoyu
NEUROCOMPUTING, 2024, 579

← 1 2 3 4 5 →