SPEAKER-CONDITIONING SINGLE-CHANNEL TARGET SPEAKER EXTRACTION USING CONFORMER-BASED ARCHITECTURES

被引:1
|
作者
Sinha, Ragini [1 ]
Tammen, Marvin [2 ,3 ]
Rollwage, Christian [1 ]
Doclo, Simon [1 ,2 ,3 ]
机构
[1] Fraunhofer Inst Digital Media Technol IDMT, Oldenburg Branch Hearing Speech & Audio Technol H, Ilmenau, Germany
[2] Carl von Ossietzky Univ Oldenburg, Dept Med Phys & Acoust, Oldenburg, Germany
[3] Carl von Ossietzky Univ Oldenburg, Cluster Excellence Hearing4all, Oldenburg, Germany
关键词
target speaker extraction; multi-task learning; TCN; attention; conformer;
D O I
10.1109/IWAENC53105.2022.9914691
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Target speaker extraction aims at extracting the target speaker from a mixture of multiple speakers exploiting auxiliary information about the target speaker. In this paper, we consider a complete time-domain target speaker extraction system consisting of a speaker embedder network and a speaker separator network which are jointly trained in an end-to-end learning process. We propose two different architectures for the speaker separator network which are based on the convolutional augmented transformer (conformer). The first architecture uses stacks of conformer and external feed-forward blocks (Conformer-FFN), while the second architecture uses stacks of temporal convolutional network (TCN) and conformer blocks (TCN-Conformer). Experimental results for 2-speaker mixtures, 3-speaker mixtures, and noisy mixtures of 2-speakers show that among the proposed separator networks, the TCN-Conformer significantly improves the target speaker extraction performance compared to the Conformer-FFN and a TCN-based baseline system.
引用
收藏
页数:5
相关论文
共 50 条
  • [21] Single-channel Speaker Separation Based on Sub-spectrum GMM and Bayesian Theory
    Guo, Haiyan
    Shao, Xi
    Yang, Zhen
    ICSP: 2008 9TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, VOLS 1-5, PROCEEDINGS, 2008, : 701 - +
  • [22] MULTI-CHANNEL TARGET SPEECH EXTRACTION WITH CHANNEL DECORRELATION AND TARGET SPEAKER ADAPTATION
    Han, Jiangyu
    Zhou, Xinyuan
    Long, Yanhua
    Li, Yijie
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6094 - 6098
  • [23] An LSTM Auto-Encoder for Single-Channel Speaker Attention System
    Rahmani, Mahnaz
    Razzazi, Farbod
    2019 9TH INTERNATIONAL CONFERENCE ON COMPUTER AND KNOWLEDGE ENGINEERING (ICCKE 2019), 2019, : 110 - 115
  • [24] Centroid Estimation with Transformer-Based Speaker Embedder for Robust Target Speaker Extraction
    Heo, Woon-Haeng
    Maeng, Joongyu
    Kang, Yoseb
    Cho, Namhyun
    INTERSPEECH 2024, 2024, : 4333 - 4337
  • [25] Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues
    Ochiai, Tsubasa
    Delcroix, Marc
    Kinoshita, Keisuke
    Ogawa, Atsunori
    Nakatani, Tomohiro
    INTERSPEECH 2019, 2019, : 2718 - 2722
  • [26] Single-channel speaker-pair identification: A new approach based on automatic frame selection
    Institute of Electronics, Communications and Information Technology, Queen's University Belfast, Belfast BT3 9DT, United Kingdom
    ICASSP IEEE Int Conf Acoust Speech Signal Process Proc, 2012, (4369-4372):
  • [27] SINGLE-CHANNEL SPEAKER-PAIR IDENTIFICATION: A NEW APPROACH BASED ON AUTOMATIC FRAME SELECTION
    Srinivasan, Ramji
    Ming, Ji
    Crookes, Danny
    2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4369 - 4372
  • [28] Parametric modelling for single-channel blind dereverberation of speech from a moving speaker
    Evers, C.
    Hopgood, J. R.
    IET SIGNAL PROCESSING, 2008, 2 (02) : 59 - 74
  • [29] Dual-Channel Target Speaker Extraction Based on Conditional Variational Autoencoder and Directional Information
    Wang, Rui
    Li, Li
    Toda, Tomoki
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1968 - 1979
  • [30] Block-based tvar models for single-channel blind dereverberation of speech from a moving speaker
    Hopgood, James R.
    Evers, Christine
    2007 IEEE/SP 14TH WORKSHOP ON STATISTICAL SIGNAL PROCESSING, VOLS 1 AND 2, 2007, : 274 - 278