Bayesian Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification

被引:8
|
作者
Zhu, Yingke [1 ]
Mak, Brian [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Comp Sci & Engn, Hong Kong, Peoples R China
关键词
Speaker verification; deep neural network; self-attention; speaker embedding; x-vectors;
D O I
10.1109/TASLP.2023.3244502
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Learning effective and discriminative speaker embed dings is a crucial task in speaker verification. Usually, speaker embeddings are extracted from a speaker-classification DNN that averages the hidden vectors over all the spoken frames of a speaker; the hidden vectors produced from all the frames are assumed to be equally important. In our previous work, we relaxed this assumption and computed the speaker embedding as a weighted average of a speaker's frame-level hidden vectors, and their weights were automatically determined by a self-attention mechanism. The effect of multiple attention heads have also been investigated to capture different aspects of a speaker's input speech. One challenge for multi-head attention is the information redundancy problem. If there is no constraint during the training of multi-head attention, different heads may extract similar attentive features, leading to the attention redundancy problem. In this paper, we generalize the deterministic multi-head attention to a Bayesian attention framework, and provide a new understanding of multi head attention from a Bayesian perspective. Under the Bayesian framework, we adopt the recently developed sampling method in optimization, which explicitly enforces the repulsiveness among the multiple heads. Systematic evaluation of the proposed Bayesian self-attentive speaker embeddings is performed on VoxCeleb and SITW evaluation sets. Significant and consistent improvements over other multi-head attention systems are achieved on all the evaluation datasets. The best Bayesian system with eight heads improves the EER by around 26% on VoxCeleb and 9% on SITW over the single-head baseline.
引用
收藏
页码:1000 / 1012
页数:13
相关论文
共 50 条
  • [41] Masked Proxy Loss For Text-Independent Speaker Verification
    Dan, Jiachen
    Kumar, Aiswarya Vinod
    Dhamyal, Hira
    Raj, Bhiksha
    Singh, Rita
    INTERSPEECH 2021, 2021, : 4638 - 4642
  • [42] A New Score Normalization for Text-Independent Speaker Verification
    Ning, Hongke
    Zou, Y. X.
    Hu, Xuyan
    2014 19TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2014, : 636 - 639
  • [43] Text-independent speaker verification:: State of the art and challenges
    Petrovska-Delacretaz, Dijana
    El Hannani, Asmaa
    Chollet, Gerard
    PROGRESS IN NONLINEAR SPEECH PROCESSING, 2007, 4391 : 135 - +
  • [44] Exploration of Local Variability in Text-Independent Speaker Verification
    Chen, Liping
    Lee, Kong Aik
    Ma, Bin
    Guo, Wu
    Li, Haizhou
    Dai, Li-Rong
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2016, 82 (02): : 217 - 228
  • [45] Local Variability Vector for Text-Independent Speaker Verification
    Chen, Liping
    Lee, Kong Aik
    Ma, Bin
    Guo, Wu
    Li, Haizhou
    Dai, Li Rong
    2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2014, : 54 - +
  • [46] A robust sequential test for text-independent speaker verification
    Lund, MA
    Lee, CC
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1996, 99 (01): : 609 - 621
  • [47] Exploration of Local Variability in Text-Independent Speaker Verification
    Liping Chen
    Kong Aik Lee
    Bin Ma
    Wu Guo
    Haizhou Li
    Li-Rong Dai
    Journal of Signal Processing Systems, 2016, 82 : 217 - 228
  • [48] FACTORED COVARIANCE MODELING FOR TEXT-INDEPENDENT SPEAKER VERIFICATION
    Wang, Eryu
    Lee, Kong Aik
    Ma, Bin
    Li, Haizhou
    Guo, Wu
    Dai, Lirong
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 4856 - 4859
  • [49] Text-independent speaker verification using covariance modeling
    Zilca, RD
    IEEE SIGNAL PROCESSING LETTERS, 2001, 8 (04) : 97 - 99
  • [50] Text-independent speaker verification with dynamic trajectory model
    Xiang, B
    IEEE SIGNAL PROCESSING LETTERS, 2003, 10 (05) : 141 - 143