Multi-level attention network: Mixed time-frequency channel attention and multi-scale self-attentive standard deviation pooling for speaker recognition

被引：3

作者：

Deng, Lihong ^{[1
]}

Deng, Fei ^{[2
]}

Zhou, Kepeng ^{[2
]}

Jiang, Peifan ^{[2
]}

Zhang, Gexiang ^{[3
]}

Yang, Qiang ^{[3
]}

机构：

[1] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligence, Chengdu, Peoples R China

[2] Chengdu Univ Technol, Coll Comp Sci & Cyber Secur, Chengdu, Peoples R China

[3] Chengdu Univ Informat Technol, Sch Automat, Chengdu, Peoples R China

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2024年 / 128卷

基金：

中国国家自然科学基金;

关键词：

Speaker recognition; Attention mechanism; Aggregation method; Multi-level attention; ARCHITECTURE;

D O I：

10.1016/j.engappai.2023.107439

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper, we propose a more efficient lightweight speaker recognition network, the multi-level attention network (MANet). MANet aims to generate more robust and discriminative speaker features by emphasizing features at different levels in the speaker recognition network through multi-level attention. The multi-level attention contains mixed time-frequency channel (MTFC) attention and multi-scale self-attentive standard deviation pooling (MSSDP). MTFC attention combines channel, time, and frequency information to capture global features and model long-term contexts. MSSDP can capture changes in frame-level features and aggregate frame-level features with different scales, generating a long-term, robust, and discriminative utterance-level feature. Therefore, MANet emphasizes the features of different levels. We performed extensive experiments on two popular datasets, Voxceleb and CN-Celeb. The proposed method is compared with the current state-of-the-art speaker recognition methods. It achieved EER/minDCF of 1.82%/0.1965, 1.94%/0.2059, 3.69%/0.3626, and 11.98%/0.4814 on the test sets Voxceleb1-O, Voxceleb1-E, Voxceleb1-H, and CN-Celeb, respectively. It is a more effective lightweight speaker recognition network, superior to most large speaker recognition networks and all lightweight speaker recognition networks tested, with an improved performance of 64% compared to the baseline system ThinResNet-34. Compared to the lightest EfficientTDNN-Small, it has only 0.6 million more parameters but 63% better performance. The performance of MANet is only 4% different compared to the state-of-the-art large model LE-Conformer. In the ablation experiments, our proposed attention method and aggregation model achieved the best experimental performance in Voxceleb1-O with EER/minDCF of 2.46%/0.2708, 2.39%/0.2417, respectively, which indicates that our proposed methods are a significant improvement over previous methods.

引用

页数：14

共 50 条

[41] Concept-guided multi-level attention network for image emotion recognition
Yang, Hansen
Fan, Yangyu
Lv, Guoyun
Liu, Shiya
Guo, Zhe
SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (05) : 4313 - 4326
[42] Camouflage Target Segmentation Algorithm Using Multi-Scale Feature Extraction and Multi-Level Attention Mechanism
Liang X.
Quan J.
Yang H.
Xiao K.
Wang Z.
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2022, 34 (05): : 683 - 692
[43] Artifact-Assisted multi-level and multi-scale feature fusion attention network for low-dose CT denoising
Cui, Xueying
Guo, Yingting
Zhang, Xiong
Hong Shangguan
Liu, Bin
Wang, Anhong
JOURNAL OF X-RAY SCIENCE AND TECHNOLOGY, 2022, 30 (05) : 875 - 889
[44] MFA: TDNN WITH MULTI-SCALE FREQUENCY-CHANNEL ATTENTION FOR TEXT-INDEPENDENT SPEAKER VERIFICATION WITH SHORT UTTERANCES
Liu, Tianchi
Das, Rohan Kumar
Lee, Kong Aik
Li, Haizhou
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7517 - 7521
[45] CloudformerV3: Multi-Scale Adapter and Multi-Level Large Window Attention for Cloud Detection
Zhang, Zheng
Tan, Shuyang
Zhou, Yongsheng
APPLIED SCIENCES-BASEL, 2023, 13 (23):
[46] A lightweight multi-scale channel attention network for image super-resolution
Li, Wenbin
Li, Juefei
Li, Jinxin
Huang, Zhiyong
Zhou, Dengwen
NEUROCOMPUTING, 2021, 456 : 327 - 337
[47] An efficient multi-scale channel attention network for person re-identification
Qian Luo
Jie Shao
Wanli Dang
Long Geng
Huaiyu Zheng
Chang Liu
The Visual Computer, 2024, 40 : 3515 - 3527
[48] Novel multi-scale deep residual attention network for facial expression recognition
Liu, Dong
Wang, Lifeng
Wang, Zhiyong
Chen, Longxi
JOURNAL OF ENGINEERING-JOE, 2020, 2020 (12): : 1220 - 1226
[49] Multi-Scale Capsule Network with Coordinate Attention for SAR Automatic Target Recognition
Xu, Huilin
Xu, Feng
2021 7TH ASIA-PACIFIC CONFERENCE ON SYNTHETIC APERTURE RADAR (APSAR), 2021,
[50] Multi-Scale Residual Channel Attention Network for Face Super-Resolution
Jin W.
Chen Y.
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2020, 32 (06): : 959 - 970

← 1 2 3 4 5 →