Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech

被引:18
|
作者
Xu, Chenglin [1 ]
Rao, Wei [2 ]
Wu, Jibin [1 ]
Li, Haizhou [1 ]
机构
[1] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 119077, Singapore
[2] Tencent Ethereal Audio Lab, Shenzhen 518057, Peoples R China
基金
新加坡国家研究基金会;
关键词
Training; Decoding; Convolution; Speech enhancement; Voice activity detection; Time-domain analysis; Task analysis; Target speaker verification; speaker extraction; single- and multi-talker speaker verification; RECOGNITION; DIARIZATION; CHANNEL; SEPARATION;
D O I
10.1109/TASLP.2021.3100682
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speaker verification has been studied mostly under the single-talker condition. It is adversely affected in the presence of interference speakers. Inspired by the study on target speaker extraction, e.g., SpEx, we propose a unified speaker verification framework for both single- and multi-talker speech, that is able to pay selective auditory attention to the target speaker. This target speaker verification (tSV) framework jointly optimizes a speaker attention module and a speaker representation module via multi-task learning. We study four different target speaker embedding schemes under the tSV framework. The experimental results show that all four target speaker embedding schemes significantly outperform other competitive solutions for multi-talker speech. Notably, the best tSV speaker embedding scheme achieves 76.0% and 55.3% relative improvements over the baseline system on the WSJ0-2mix-extr and Libri2Mix corpora in terms of equal-error-rate for 2-talker speech, while the performance of tSV for single-talker speech is on par with that of traditional speaker verification system, that is trained and evaluated under the same single-talker condition.
引用
收藏
页码:2696 / 2709
页数:14
相关论文
共 50 条
  • [31] A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition
    Tu, Yan-Hui
    Du, Jun
    Dai, Li-Rung
    Lee, Chin-Hui
    2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [32] Training-induced brain activation and functional connectivity differentiate multi-talker and single-talker speech training
    Deng, Zhizhou
    Chandrasekaran, Bharath
    Wang, Suiping
    Wong, Patrick C. M.
    NEUROBIOLOGY OF LEARNING AND MEMORY, 2018, 151 : 1 - 9
  • [33] END-TO-END MULTI-TALKER OVERLAPPING SPEECH RECOGNITION
    Tripathi, Anshuman
    Lu, Han
    Sak, Hasim
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6129 - 6133
  • [34] Variational Loopy Belief Propagation for Multi-talker Speech Recognition
    Rennie, Steven J.
    Hershey, John R.
    Olsen, Peder A.
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1367 - 1370
  • [35] Streaming End-to-End Multi-Talker Speech Recognition
    Lu, Liang
    Kanda, Naoyuki
    Li, Jinyu
    Gong, Yifan
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 803 - 807
  • [36] ACOUSTIC MODELING FOR DISTANT MULTI-TALKER SPEECH RECOGNITION WITH SINGLE- AND MULTI-CHANNEL BRANCHES
    Kanda, Naoyuki
    Fujita, Yusuke
    Horiguchi, Shota
    Ikeshita, Rintaro
    Nagamatsu, Kenji
    Watanabe, Shinji
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6630 - 6634
  • [37] A collection of pseudo-words to study multi-talker speech intelligibility without shifts of spatial attention
    Allen, Kachina
    Alais, David
    Carlili, Simon
    FRONTIERS IN PSYCHOLOGY, 2012, 3
  • [38] Audio-visual target speaker enhancement on multi-talker environment using event-driven cameras
    Arriandiaga, Ander
    Morrone, Giovanni
    Pasa, Luca
    Badino, Leonardo
    Bartolozzi, Chiara
    2021 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2021,
  • [39] KNOWLEDGE TRANSFER IN PERMUTATION INVARIANT TRAINING FOR SINGLE-CHANNEL MULTI-TALKER SPEECH RECOGNITION
    Tan, Tian
    Qian, Yanmin
    Yu, Dong
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5714 - 5718
  • [40] Target identification using relative level in multi-talker listening
    Kitterick, Padraig T.
    Clarke, Emmet
    O'Shea, Charlotte
    Seymour, Josephine
    Summerfield, A. Quentin
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2013, 133 (05): : 2899 - 2909