Multi-level attention network: Mixed time-frequency channel attention and multi-scale self-attentive standard deviation pooling for speaker recognition

被引:3
|
作者
Deng, Lihong [1 ]
Deng, Fei [2 ]
Zhou, Kepeng [2 ]
Jiang, Peifan [2 ]
Zhang, Gexiang [3 ]
Yang, Qiang [3 ]
机构
[1] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligence, Chengdu, Peoples R China
[2] Chengdu Univ Technol, Coll Comp Sci & Cyber Secur, Chengdu, Peoples R China
[3] Chengdu Univ Informat Technol, Sch Automat, Chengdu, Peoples R China
基金
中国国家自然科学基金;
关键词
Speaker recognition; Attention mechanism; Aggregation method; Multi-level attention; ARCHITECTURE;
D O I
10.1016/j.engappai.2023.107439
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we propose a more efficient lightweight speaker recognition network, the multi-level attention network (MANet). MANet aims to generate more robust and discriminative speaker features by emphasizing features at different levels in the speaker recognition network through multi-level attention. The multi-level attention contains mixed time-frequency channel (MTFC) attention and multi-scale self-attentive standard deviation pooling (MSSDP). MTFC attention combines channel, time, and frequency information to capture global features and model long-term contexts. MSSDP can capture changes in frame-level features and aggregate frame-level features with different scales, generating a long-term, robust, and discriminative utterance-level feature. Therefore, MANet emphasizes the features of different levels. We performed extensive experiments on two popular datasets, Voxceleb and CN-Celeb. The proposed method is compared with the current state-of-the-art speaker recognition methods. It achieved EER/minDCF of 1.82%/0.1965, 1.94%/0.2059, 3.69%/0.3626, and 11.98%/0.4814 on the test sets Voxceleb1-O, Voxceleb1-E, Voxceleb1-H, and CN-Celeb, respectively. It is a more effective lightweight speaker recognition network, superior to most large speaker recognition networks and all lightweight speaker recognition networks tested, with an improved performance of 64% compared to the baseline system ThinResNet-34. Compared to the lightest EfficientTDNN-Small, it has only 0.6 million more parameters but 63% better performance. The performance of MANet is only 4% different compared to the state-of-the-art large model LE-Conformer. In the ablation experiments, our proposed attention method and aggregation model achieved the best experimental performance in Voxceleb1-O with EER/minDCF of 2.46%/0.2708, 2.39%/0.2417, respectively, which indicates that our proposed methods are a significant improvement over previous methods.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Concept-guided multi-level attention network for image emotion recognition
    Yang, Hansen
    Fan, Yangyu
    Lv, Guoyun
    Liu, Shiya
    Guo, Zhe
    SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (05) : 4313 - 4326
  • [42] Camouflage Target Segmentation Algorithm Using Multi-Scale Feature Extraction and Multi-Level Attention Mechanism
    Liang X.
    Quan J.
    Yang H.
    Xiao K.
    Wang Z.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2022, 34 (05): : 683 - 692
  • [43] Artifact-Assisted multi-level and multi-scale feature fusion attention network for low-dose CT denoising
    Cui, Xueying
    Guo, Yingting
    Zhang, Xiong
    Hong Shangguan
    Liu, Bin
    Wang, Anhong
    JOURNAL OF X-RAY SCIENCE AND TECHNOLOGY, 2022, 30 (05) : 875 - 889
  • [44] MFA: TDNN WITH MULTI-SCALE FREQUENCY-CHANNEL ATTENTION FOR TEXT-INDEPENDENT SPEAKER VERIFICATION WITH SHORT UTTERANCES
    Liu, Tianchi
    Das, Rohan Kumar
    Lee, Kong Aik
    Li, Haizhou
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7517 - 7521
  • [45] CloudformerV3: Multi-Scale Adapter and Multi-Level Large Window Attention for Cloud Detection
    Zhang, Zheng
    Tan, Shuyang
    Zhou, Yongsheng
    APPLIED SCIENCES-BASEL, 2023, 13 (23):
  • [46] A lightweight multi-scale channel attention network for image super-resolution
    Li, Wenbin
    Li, Juefei
    Li, Jinxin
    Huang, Zhiyong
    Zhou, Dengwen
    NEUROCOMPUTING, 2021, 456 : 327 - 337
  • [47] An efficient multi-scale channel attention network for person re-identification
    Qian Luo
    Jie Shao
    Wanli Dang
    Long Geng
    Huaiyu Zheng
    Chang Liu
    The Visual Computer, 2024, 40 : 3515 - 3527
  • [48] Novel multi-scale deep residual attention network for facial expression recognition
    Liu, Dong
    Wang, Lifeng
    Wang, Zhiyong
    Chen, Longxi
    JOURNAL OF ENGINEERING-JOE, 2020, 2020 (12): : 1220 - 1226
  • [49] Multi-Scale Capsule Network with Coordinate Attention for SAR Automatic Target Recognition
    Xu, Huilin
    Xu, Feng
    2021 7TH ASIA-PACIFIC CONFERENCE ON SYNTHETIC APERTURE RADAR (APSAR), 2021,
  • [50] Multi-Scale Residual Channel Attention Network for Face Super-Resolution
    Jin W.
    Chen Y.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2020, 32 (06): : 959 - 970