Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech

被引:18
|
作者
Xu, Chenglin [1 ]
Rao, Wei [2 ]
Wu, Jibin [1 ]
Li, Haizhou [1 ]
机构
[1] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 119077, Singapore
[2] Tencent Ethereal Audio Lab, Shenzhen 518057, Peoples R China
基金
新加坡国家研究基金会;
关键词
Training; Decoding; Convolution; Speech enhancement; Voice activity detection; Time-domain analysis; Task analysis; Target speaker verification; speaker extraction; single- and multi-talker speaker verification; RECOGNITION; DIARIZATION; CHANNEL; SEPARATION;
D O I
10.1109/TASLP.2021.3100682
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speaker verification has been studied mostly under the single-talker condition. It is adversely affected in the presence of interference speakers. Inspired by the study on target speaker extraction, e.g., SpEx, we propose a unified speaker verification framework for both single- and multi-talker speech, that is able to pay selective auditory attention to the target speaker. This target speaker verification (tSV) framework jointly optimizes a speaker attention module and a speaker representation module via multi-task learning. We study four different target speaker embedding schemes under the tSV framework. The experimental results show that all four target speaker embedding schemes significantly outperform other competitive solutions for multi-talker speech. Notably, the best tSV speaker embedding scheme achieves 76.0% and 55.3% relative improvements over the baseline system on the WSJ0-2mix-extr and Libri2Mix corpora in terms of equal-error-rate for 2-talker speech, while the performance of tSV for single-talker speech is on par with that of traditional speaker verification system, that is trained and evaluated under the same single-talker condition.
引用
收藏
页码:2696 / 2709
页数:14
相关论文
共 50 条
  • [41] Target identification using relative level in multi-talker listening
    Kitterick, Pádraig T.
    Clarke, Emmet
    Oshea, Charlotte
    Seymour, Josephine
    Quentin Summerfield, A.
    Journal of the Acoustical Society of America, 2013, 133 (05): : 2899 - 2909
  • [42] The Impact of Speech-Irrelevant Head Movements on Speech Intelligibility in Multi-Talker Environments
    Frissen, Ilja
    Scherzer, Johannes
    Yao, Hsin-Yun
    ACTA ACUSTICA UNITED WITH ACUSTICA, 2019, 105 (06) : 1286 - 1290
  • [43] EEG correlates of spatial shifts of attention in a dynamic multi-talker speech perception scenario in younger and older adults
    Getzmann, Stephan
    Klatt, Laura-Isabelle
    Schneider, Daniel
    Begau, Alexandra
    Wascher, Edmund
    HEARING RESEARCH, 2020, 398
  • [44] SPEAKER CHANGE DETECTION USING FUNDAMENTAL FREQUENCY WITH APPLICATION TO MULTI-TALKER SEGMENTATION
    Hogg, Aidan O. T.
    Evers, Christine
    Naylor, Patrick A.
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5826 - 5830
  • [45] A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech
    Yan-Hui Tu
    Jun Du
    Chin-Hui Lee
    Journal of Signal Processing Systems, 2018, 90 : 963 - 973
  • [46] Auditory recognition of Persian digits in presence of speech-spectrum noise and multi-talker babble: a validation study
    Ebrahimi, Amin
    Mahdavi, Mohammad Ebrahim
    Jalilvand, Hamid
    AUDITORY AND VESTIBULAR RESEARCH, 2020, 29 (01): : 39 - 47
  • [47] FACE LANDMARK-BASED SPEAKER-INDEPENDENT AUDIO-VISUAL SPEECH ENHANCEMENT IN MULTI-TALKER ENVIRONMENTS
    Morrone, Giovanni
    Pasa, Luca
    Tikhanoff, Vadim
    Bergamaschi, Sonia
    Fadiga, Luciano
    Badino, Leonardo
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6900 - 6904
  • [48] A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech
    Tu, Yan-Hui
    Du, Jun
    Lee, Chin-Hui
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2018, 90 (07): : 963 - 973
  • [49] The effect of nearby maskers on speech intelligibility in reverberant, multi-talker environments
    Westermann, Adam
    Buchholz, Joerg M.
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (03): : 2214 - 2223
  • [50] Peripheral hearing loss reduces the ability of children to direct selective attention during multi-talker listening
    Holmes, Emma
    Kitterick, Padraig T.
    Summerfield, A. Quentin
    HEARING RESEARCH, 2017, 350 : 160 - 172