Rep-MCA-former: An efficient multi-scale convolution attention encoder for text-independent speaker verification

被引：3

作者：

Liu, Xiaohu ^{[1
]}

Chen, Defu ^{[1
]}

Wang, Xianbao ^{[1
]}

Xiang, Sheng ^{[1
]}

Zhou, Xuwen ^{[1
]}

机构：

[1] Zhejiang Univ Technol, Informat Engineer Coll, Hangzhou 310023, Zhejiang, Peoples R China

来源：

COMPUTER SPEECH AND LANGUAGE | 2024年 / 85卷

关键词：

Speaker verification; Transformer encoder; Multi-scale convolution; Re-parameterization;

D O I：

10.1016/j.csl.2023.101600

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In many speaker verification tasks, the quality of speaker embedding is an important factor in affecting speaker verification systems. Advanced speaker embedding extraction networks aim to capture richer speaker features through the multi-branch network architecture. Recently, speaker verification systems based on transformer encoders have received much attention, and many satisfactory results have been achieved because transformer encoders can efficiently extract the global features of the speaker (e.g., MFA-Conformer). However, the large number of model parameters and computational latency are common problems faced by the above approaches, which make them difficult to apply to resource-constrained edge terminals. To address this issue, this paper proposes an effective, lightweight transformer model (MCA-former) with multi-scale convolutional self-attention (MCA), which can perform multi-scale modeling and channel modeling in the temporal direction of the input with low computational cost. In addition, in the inference phase of the model, we further develop a systematic re-parameterization method to convert the multi-branch network structure into the single-path topology, effectively improving the inference speed. We investigate the performance of the MCA-former for speaker verification under the VoxCeleb1 test set. The results show that the MCA-based transformer model is more advantageous in terms of the number of parameters and inference efficiency. By applying the re-parameterization, the inference speed of the model is increased by about 30%, and the memory consumption is significantly improved.

引用

页数：13

共 30 条

[1] MFA: TDNN WITH MULTI-SCALE FREQUENCY-CHANNEL ATTENTION FOR TEXT-INDEPENDENT SPEAKER VERIFICATION WITH SHORT UTTERANCES
Liu, Tianchi
Das, Rohan Kumar
Lee, Kong Aik
Li, Haizhou
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7517 - 7521
[2] CNN WITH PHONETIC ATTENTION FOR TEXT-INDEPENDENT SPEAKER VERIFICATION
Zhou, Tianyan
Zhao, Yong
Li, Jinyu
Gong, Yifan
Wu, Jian
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 718 - 725
[3] Text-Independent Speaker Verification with Dual Attention Network
Li, Jingyu
Lee, Tan
INTERSPEECH 2020, 2020, : 956 - 960
[4] Self-Attention Networks for Text-Independent Speaker Verification
Bian, Tengyue
Chen, Fangzhou
Xu, Li
PROCEEDINGS OF THE 2019 31ST CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2019), 2019, : 3955 - 3960
[5] Context-adaptive Gaussian Attention for Text-independent Speaker Verification
Peng, Junyi
Gu, Rongzhi
Zhang, Haoran
Zou, Yuexian
2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 595 - 599
[6] DeltaVLAD: An efficient optimization algorithm to discriminate speaker embedding for text-independent speaker verification
Guo, Xin
Luo, Chengfang
Deng, Aiwen
Deng, Feiqi
AIMS MATHEMATICS, 2022, 7 (04): : 6381 - 6395
[7] Deep multi-metric learning for text-independent speaker verification
Xu, Jiwei
Wang, Xinggang
Feng, Bin
Liu, Wenyu
NEUROCOMPUTING, 2020, 410 : 394 - 400
[8] ADAPTATION OF PLDA FOR MULTI-SOURCE TEXT-INDEPENDENT SPEAKER VERIFICATION
Chen, Liping
Lee, Kong Aik
Ma, Bin
Ma, Long
Li, Haizhou
Dai, Li-Rong
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5380 - 5384
[9] DEEP SPEAKER EMBEDDING LEARNING WITH MULTI-LEVEL POOLING FOR TEXT-INDEPENDENT SPEAKER VERIFICATION
Tang, Yun
Ding, Guohong
Huang, Jing
He, Xiaodong
Zhou, Bowen
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6116 - 6120
[10] An efficient text-independent speaker verification for short utterance data from Mobile devices
Sanghamitra V. Arora
Rekha Vig
Multimedia Tools and Applications, 2020, 79 : 3049 - 3074

← 1 2 3 →