Bayesian Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification

被引：8

作者：

Zhu, Yingke ^{[1
]}

Mak, Brian ^{[1
]}

机构：

[1] Hong Kong Univ Sci & Technol, Comp Sci & Engn, Hong Kong, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

关键词：

Speaker verification; deep neural network; self-attention; speaker embedding; x-vectors;

D O I：

10.1109/TASLP.2023.3244502

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Learning effective and discriminative speaker embed dings is a crucial task in speaker verification. Usually, speaker embeddings are extracted from a speaker-classification DNN that averages the hidden vectors over all the spoken frames of a speaker; the hidden vectors produced from all the frames are assumed to be equally important. In our previous work, we relaxed this assumption and computed the speaker embedding as a weighted average of a speaker's frame-level hidden vectors, and their weights were automatically determined by a self-attention mechanism. The effect of multiple attention heads have also been investigated to capture different aspects of a speaker's input speech. One challenge for multi-head attention is the information redundancy problem. If there is no constraint during the training of multi-head attention, different heads may extract similar attentive features, leading to the attention redundancy problem. In this paper, we generalize the deterministic multi-head attention to a Bayesian attention framework, and provide a new understanding of multi head attention from a Bayesian perspective. Under the Bayesian framework, we adopt the recently developed sampling method in optimization, which explicitly enforces the repulsiveness among the multiple heads. Systematic evaluation of the proposed Bayesian self-attentive speaker embeddings is performed on VoxCeleb and SITW evaluation sets. Significant and consistent improvements over other multi-head attention systems are achieved on all the evaluation datasets. The best Bayesian system with eight heads improves the EER by around 26% on VoxCeleb and 9% on SITW over the single-head baseline.

引用

页码：1000 / 1012

页数：13

共 50 条

[41] Masked Proxy Loss For Text-Independent Speaker Verification
Dan, Jiachen
Kumar, Aiswarya Vinod
Dhamyal, Hira
Raj, Bhiksha
Singh, Rita
INTERSPEECH 2021, 2021, : 4638 - 4642
[42] A New Score Normalization for Text-Independent Speaker Verification
Ning, Hongke
Zou, Y. X.
Hu, Xuyan
2014 19TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2014, : 636 - 639
[43] Text-independent speaker verification:: State of the art and challenges
Petrovska-Delacretaz, Dijana
El Hannani, Asmaa
Chollet, Gerard
PROGRESS IN NONLINEAR SPEECH PROCESSING, 2007, 4391 : 135 - +
[44] Exploration of Local Variability in Text-Independent Speaker Verification
Chen, Liping
Lee, Kong Aik
Ma, Bin
Guo, Wu
Li, Haizhou
Dai, Li-Rong
JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2016, 82 (02): : 217 - 228
[45] Local Variability Vector for Text-Independent Speaker Verification
Chen, Liping
Lee, Kong Aik
Ma, Bin
Guo, Wu
Li, Haizhou
Dai, Li Rong
2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2014, : 54 - +
[46] A robust sequential test for text-independent speaker verification
Lund, MA
Lee, CC
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1996, 99 (01): : 609 - 621
[47] Exploration of Local Variability in Text-Independent Speaker Verification
Liping Chen
Kong Aik Lee
Bin Ma
Wu Guo
Haizhou Li
Li-Rong Dai
Journal of Signal Processing Systems, 2016, 82 : 217 - 228
[48] FACTORED COVARIANCE MODELING FOR TEXT-INDEPENDENT SPEAKER VERIFICATION
Wang, Eryu
Lee, Kong Aik
Ma, Bin
Li, Haizhou
Guo, Wu
Dai, Lirong
2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 4856 - 4859
[49] Text-independent speaker verification using covariance modeling
Zilca, RD
IEEE SIGNAL PROCESSING LETTERS, 2001, 8 (04) : 97 - 99
[50] Text-independent speaker verification with dynamic trajectory model
Xiang, B
IEEE SIGNAL PROCESSING LETTERS, 2003, 10 (05) : 141 - 143

← 1 2 3 4 5 →