Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

被引:7
|
作者
Zhao, Zifeng [1 ]
Yang, Dongchao [1 ]
Gu, Rongzhi [1 ]
Zhang, Haoran [1 ]
Zou, Yuexian [1 ]
机构
[1] Peking Univ, Sch ECE, ADSPLAB, Shenzhen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
speech separation; end-to-end speaker extraction; target confusion problem; metric learning; post-filtering; SPEECH SEPARATION;
D O I
10.21437/Interspeech.2022-176
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, end-to-end speaker extraction has attracted increasing attention and shown promising results. However, its performance is often inferior to that of a blind speech separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings. Such ambiguous guidance information may confuse the separation network and hence lead to wrong extraction results, which deteriorates the overall performance. We refer to this as the target confusion problem. In this paper, we conduct an analysis of such an issue and solve it in two stages. In the training phase, we propose to integrate metric learning methods to improve the distinguishability of embeddings produced by the speaker encoder. While for inference, a novel post-filtering strategy is designed to revise wrong results. Specifically, we first identify these confusion samples by measuring the similarities between output estimates and enrollment utterances, after which the true target sources are recovered by a subtraction operation. Experiments show that performance improvement of more than 1 dB SI-SDRi can be brought, which validates the effectiveness of our methods and emphasizes the impact of the target confusion problem(1).
引用
收藏
页码:5333 / 5337
页数:5
相关论文
共 50 条
  • [41] End-to-end speaker segmentation for overlap-aware resegmentation
    Bredin, Herve
    Laurent, Antoine
    INTERSPEECH 2021, 2021, : 3111 - 3115
  • [42] Robust End-to-end Speaker Diarization with Generic Neural Clustering
    Yang, Chenyu
    Wang, Yu
    INTERSPEECH 2022, 2022, : 1471 - 1475
  • [43] Strategies for End-to-End Text-Independent Speaker Verification
    Lin, Weiwei
    Mak, Man-Wai
    Chien, Jen-Tzung
    INTERSPEECH 2020, 2020, : 4308 - 4312
  • [44] End-to-end recurrent denoising autoencoder embeddings for speaker identification
    Rituerto-Gonzalez, Esther
    Pelaez-Moreno, Carmen
    NEURAL COMPUTING & APPLICATIONS, 2021, 33 (21): : 14429 - 14439
  • [45] SVSNet: An End-to-End Speaker Voice Similarity Assessment Model
    Hu, Cheng-Hung
    Peng, Yu-Huai
    Yamagishi, Junichi
    Tsao, Yu
    Wang, Hsin-Min
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 767 - 771
  • [46] End-to-end framework for spoof-aware speaker verification
    Kang, Woo Hyun
    Alam, Jahangir
    Fathan, Abderrahim
    INTERSPEECH 2022, 2022, : 4362 - 4366
  • [47] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
    Chang, Xuankai
    Zhang, Wangyou
    Qian, Yanmin
    Le Roux, Jonathan
    Watanabe, Shinji
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138
  • [48] End-To-End Phonetic Neural Network Approach for Speaker Verification
    Demirbag, Sedat
    Erden, Mustafa
    Arslan, Levent
    2020 28TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2020,
  • [49] END-TO-END NEURAL SPEAKER DIARIZATION WITH SELF-ATTENTION
    Fujita, Yusuke
    Kanda, Naoyuki
    Horiguchi, Shota
    Xue, Yawen
    Nagamatsu, Kenji
    Watanabe, Shinji
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 296 - 303
  • [50] End-to-End Multilingual Multi-Speaker Speech Recognition
    Seki, Hiroshi
    Hori, Takaaki
    Watanabe, Shinji
    Le Roux, Jonathan
    Hershey, John R.
    INTERSPEECH 2019, 2019, : 3755 - 3759