Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech

被引：18

作者：

Xu, Chenglin ^{[1
]}

Rao, Wei ^{[2
]}

Wu, Jibin ^{[1
]}

Li, Haizhou ^{[1
]}

机构：

[1] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 119077, Singapore

[2] Tencent Ethereal Audio Lab, Shenzhen 518057, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2021年 / 29卷

基金：

新加坡国家研究基金会;

关键词：

Training; Decoding; Convolution; Speech enhancement; Voice activity detection; Time-domain analysis; Task analysis; Target speaker verification; speaker extraction; single- and multi-talker speaker verification; RECOGNITION; DIARIZATION; CHANNEL; SEPARATION;

D O I：

10.1109/TASLP.2021.3100682

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speaker verification has been studied mostly under the single-talker condition. It is adversely affected in the presence of interference speakers. Inspired by the study on target speaker extraction, e.g., SpEx, we propose a unified speaker verification framework for both single- and multi-talker speech, that is able to pay selective auditory attention to the target speaker. This target speaker verification (tSV) framework jointly optimizes a speaker attention module and a speaker representation module via multi-task learning. We study four different target speaker embedding schemes under the tSV framework. The experimental results show that all four target speaker embedding schemes significantly outperform other competitive solutions for multi-talker speech. Notably, the best tSV speaker embedding scheme achieves 76.0% and 55.3% relative improvements over the baseline system on the WSJ0-2mix-extr and Libri2Mix corpora in terms of equal-error-rate for 2-talker speech, while the performance of tSV for single-talker speech is on par with that of traditional speaker verification system, that is trained and evaluated under the same single-talker condition.

引用

页码：2696 / 2709

页数：14

共 50 条

[31] A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition
Tu, Yan-Hui
Du, Jun
Dai, Li-Rung
Lee, Chin-Hui
2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
[32] Training-induced brain activation and functional connectivity differentiate multi-talker and single-talker speech training
Deng, Zhizhou
Chandrasekaran, Bharath
Wang, Suiping
Wong, Patrick C. M.
NEUROBIOLOGY OF LEARNING AND MEMORY, 2018, 151 : 1 - 9
[33] END-TO-END MULTI-TALKER OVERLAPPING SPEECH RECOGNITION
Tripathi, Anshuman
Lu, Han
Sak, Hasim
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6129 - 6133
[34] Variational Loopy Belief Propagation for Multi-talker Speech Recognition
Rennie, Steven J.
Hershey, John R.
Olsen, Peder A.
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1367 - 1370
[35] Streaming End-to-End Multi-Talker Speech Recognition
Lu, Liang
Kanda, Naoyuki
Li, Jinyu
Gong, Yifan
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 803 - 807
[36] ACOUSTIC MODELING FOR DISTANT MULTI-TALKER SPEECH RECOGNITION WITH SINGLE- AND MULTI-CHANNEL BRANCHES
Kanda, Naoyuki
Fujita, Yusuke
Horiguchi, Shota
Ikeshita, Rintaro
Nagamatsu, Kenji
Watanabe, Shinji
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6630 - 6634
[37] A collection of pseudo-words to study multi-talker speech intelligibility without shifts of spatial attention
Allen, Kachina
Alais, David
Carlili, Simon
FRONTIERS IN PSYCHOLOGY, 2012, 3
[38] Audio-visual target speaker enhancement on multi-talker environment using event-driven cameras
Arriandiaga, Ander
Morrone, Giovanni
Pasa, Luca
Badino, Leonardo
Bartolozzi, Chiara
2021 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2021,
[39] KNOWLEDGE TRANSFER IN PERMUTATION INVARIANT TRAINING FOR SINGLE-CHANNEL MULTI-TALKER SPEECH RECOGNITION
Tan, Tian
Qian, Yanmin
Yu, Dong
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5714 - 5718
[40] Target identification using relative level in multi-talker listening
Kitterick, Padraig T.
Clarke, Emmet
O'Shea, Charlotte
Seymour, Josephine
Summerfield, A. Quentin
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2013, 133 (05): : 2899 - 2909

← 1 2 3 4 5 →