Rep-MCA-former: An efficient multi-scale convolution attention encoder for text-independent speaker verification

被引:3
|
作者
Liu, Xiaohu [1 ]
Chen, Defu [1 ]
Wang, Xianbao [1 ]
Xiang, Sheng [1 ]
Zhou, Xuwen [1 ]
机构
[1] Zhejiang Univ Technol, Informat Engineer Coll, Hangzhou 310023, Zhejiang, Peoples R China
来源
关键词
Speaker verification; Transformer encoder; Multi-scale convolution; Re-parameterization;
D O I
10.1016/j.csl.2023.101600
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In many speaker verification tasks, the quality of speaker embedding is an important factor in affecting speaker verification systems. Advanced speaker embedding extraction networks aim to capture richer speaker features through the multi-branch network architecture. Recently, speaker verification systems based on transformer encoders have received much attention, and many satisfactory results have been achieved because transformer encoders can efficiently extract the global features of the speaker (e.g., MFA-Conformer). However, the large number of model parameters and computational latency are common problems faced by the above approaches, which make them difficult to apply to resource-constrained edge terminals. To address this issue, this paper proposes an effective, lightweight transformer model (MCA-former) with multi-scale convolutional self-attention (MCA), which can perform multi-scale modeling and channel modeling in the temporal direction of the input with low computational cost. In addition, in the inference phase of the model, we further develop a systematic re-parameterization method to convert the multi-branch network structure into the single-path topology, effectively improving the inference speed. We investigate the performance of the MCA-former for speaker verification under the VoxCeleb1 test set. The results show that the MCA-based transformer model is more advantageous in terms of the number of parameters and inference efficiency. By applying the re-parameterization, the inference speed of the model is increased by about 30%, and the memory consumption is significantly improved.
引用
收藏
页数:13
相关论文
共 30 条
  • [1] MFA: TDNN WITH MULTI-SCALE FREQUENCY-CHANNEL ATTENTION FOR TEXT-INDEPENDENT SPEAKER VERIFICATION WITH SHORT UTTERANCES
    Liu, Tianchi
    Das, Rohan Kumar
    Lee, Kong Aik
    Li, Haizhou
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7517 - 7521
  • [2] CNN WITH PHONETIC ATTENTION FOR TEXT-INDEPENDENT SPEAKER VERIFICATION
    Zhou, Tianyan
    Zhao, Yong
    Li, Jinyu
    Gong, Yifan
    Wu, Jian
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 718 - 725
  • [3] Text-Independent Speaker Verification with Dual Attention Network
    Li, Jingyu
    Lee, Tan
    INTERSPEECH 2020, 2020, : 956 - 960
  • [4] Self-Attention Networks for Text-Independent Speaker Verification
    Bian, Tengyue
    Chen, Fangzhou
    Xu, Li
    PROCEEDINGS OF THE 2019 31ST CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2019), 2019, : 3955 - 3960
  • [5] Context-adaptive Gaussian Attention for Text-independent Speaker Verification
    Peng, Junyi
    Gu, Rongzhi
    Zhang, Haoran
    Zou, Yuexian
    2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 595 - 599
  • [6] DeltaVLAD: An efficient optimization algorithm to discriminate speaker embedding for text-independent speaker verification
    Guo, Xin
    Luo, Chengfang
    Deng, Aiwen
    Deng, Feiqi
    AIMS MATHEMATICS, 2022, 7 (04): : 6381 - 6395
  • [7] Deep multi-metric learning for text-independent speaker verification
    Xu, Jiwei
    Wang, Xinggang
    Feng, Bin
    Liu, Wenyu
    NEUROCOMPUTING, 2020, 410 : 394 - 400
  • [8] ADAPTATION OF PLDA FOR MULTI-SOURCE TEXT-INDEPENDENT SPEAKER VERIFICATION
    Chen, Liping
    Lee, Kong Aik
    Ma, Bin
    Ma, Long
    Li, Haizhou
    Dai, Li-Rong
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5380 - 5384
  • [9] DEEP SPEAKER EMBEDDING LEARNING WITH MULTI-LEVEL POOLING FOR TEXT-INDEPENDENT SPEAKER VERIFICATION
    Tang, Yun
    Ding, Guohong
    Huang, Jing
    He, Xiaodong
    Zhou, Bowen
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6116 - 6120
  • [10] An efficient text-independent speaker verification for short utterance data from Mobile devices
    Sanghamitra V. Arora
    Rekha Vig
    Multimedia Tools and Applications, 2020, 79 : 3049 - 3074