Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech

被引：18

作者：

Xu, Chenglin ^{[1
]}

Rao, Wei ^{[2
]}

Wu, Jibin ^{[1
]}

Li, Haizhou ^{[1
]}

机构：

[1] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 119077, Singapore

[2] Tencent Ethereal Audio Lab, Shenzhen 518057, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2021年 / 29卷

基金：

新加坡国家研究基金会;

关键词：

Training; Decoding; Convolution; Speech enhancement; Voice activity detection; Time-domain analysis; Task analysis; Target speaker verification; speaker extraction; single- and multi-talker speaker verification; RECOGNITION; DIARIZATION; CHANNEL; SEPARATION;

D O I：

10.1109/TASLP.2021.3100682

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speaker verification has been studied mostly under the single-talker condition. It is adversely affected in the presence of interference speakers. Inspired by the study on target speaker extraction, e.g., SpEx, we propose a unified speaker verification framework for both single- and multi-talker speech, that is able to pay selective auditory attention to the target speaker. This target speaker verification (tSV) framework jointly optimizes a speaker attention module and a speaker representation module via multi-task learning. We study four different target speaker embedding schemes under the tSV framework. The experimental results show that all four target speaker embedding schemes significantly outperform other competitive solutions for multi-talker speech. Notably, the best tSV speaker embedding scheme achieves 76.0% and 55.3% relative improvements over the baseline system on the WSJ0-2mix-extr and Libri2Mix corpora in terms of equal-error-rate for 2-talker speech, while the performance of tSV for single-talker speech is on par with that of traditional speaker verification system, that is trained and evaluated under the same single-talker condition.

引用

页码：2696 / 2709

页数：14

共 50 条

[41] Target identification using relative level in multi-talker listening
Kitterick, Pádraig T.
Clarke, Emmet
Oshea, Charlotte
Seymour, Josephine
Quentin Summerfield, A.
Journal of the Acoustical Society of America, 2013, 133 (05): : 2899 - 2909
[42] The Impact of Speech-Irrelevant Head Movements on Speech Intelligibility in Multi-Talker Environments
Frissen, Ilja
Scherzer, Johannes
Yao, Hsin-Yun
ACTA ACUSTICA UNITED WITH ACUSTICA, 2019, 105 (06) : 1286 - 1290
[43] EEG correlates of spatial shifts of attention in a dynamic multi-talker speech perception scenario in younger and older adults
Getzmann, Stephan
Klatt, Laura-Isabelle
Schneider, Daniel
Begau, Alexandra
Wascher, Edmund
HEARING RESEARCH, 2020, 398
[44] SPEAKER CHANGE DETECTION USING FUNDAMENTAL FREQUENCY WITH APPLICATION TO MULTI-TALKER SEGMENTATION
Hogg, Aidan O. T.
Evers, Christine
Naylor, Patrick A.
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5826 - 5830
[45] A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech
Yan-Hui Tu
Jun Du
Chin-Hui Lee
Journal of Signal Processing Systems, 2018, 90 : 963 - 973
[46] Auditory recognition of Persian digits in presence of speech-spectrum noise and multi-talker babble: a validation study
Ebrahimi, Amin
Mahdavi, Mohammad Ebrahim
Jalilvand, Hamid
AUDITORY AND VESTIBULAR RESEARCH, 2020, 29 (01): : 39 - 47
[47] FACE LANDMARK-BASED SPEAKER-INDEPENDENT AUDIO-VISUAL SPEECH ENHANCEMENT IN MULTI-TALKER ENVIRONMENTS
Morrone, Giovanni
Pasa, Luca
Tikhanoff, Vadim
Bergamaschi, Sonia
Fadiga, Luciano
Badino, Leonardo
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6900 - 6904
[48] A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech
Tu, Yan-Hui
Du, Jun
Lee, Chin-Hui
JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2018, 90 (07): : 963 - 973
[49] The effect of nearby maskers on speech intelligibility in reverberant, multi-talker environments
Westermann, Adam
Buchholz, Joerg M.
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (03): : 2214 - 2223
[50] Peripheral hearing loss reduces the ability of children to direct selective attention during multi-talker listening
Holmes, Emma
Kitterick, Padraig T.
Summerfield, A. Quentin
HEARING RESEARCH, 2017, 350 : 160 - 172

← 1 2 3 4 5 →