SPEAKER-CONDITIONING SINGLE-CHANNEL TARGET SPEAKER EXTRACTION USING CONFORMER-BASED ARCHITECTURES

被引:1
|
作者
Sinha, Ragini [1 ]
Tammen, Marvin [2 ,3 ]
Rollwage, Christian [1 ]
Doclo, Simon [1 ,2 ,3 ]
机构
[1] Fraunhofer Inst Digital Media Technol IDMT, Oldenburg Branch Hearing Speech & Audio Technol H, Ilmenau, Germany
[2] Carl von Ossietzky Univ Oldenburg, Dept Med Phys & Acoust, Oldenburg, Germany
[3] Carl von Ossietzky Univ Oldenburg, Cluster Excellence Hearing4all, Oldenburg, Germany
关键词
target speaker extraction; multi-task learning; TCN; attention; conformer;
D O I
10.1109/IWAENC53105.2022.9914691
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Target speaker extraction aims at extracting the target speaker from a mixture of multiple speakers exploiting auxiliary information about the target speaker. In this paper, we consider a complete time-domain target speaker extraction system consisting of a speaker embedder network and a speaker separator network which are jointly trained in an end-to-end learning process. We propose two different architectures for the speaker separator network which are based on the convolutional augmented transformer (conformer). The first architecture uses stacks of conformer and external feed-forward blocks (Conformer-FFN), while the second architecture uses stacks of temporal convolutional network (TCN) and conformer blocks (TCN-Conformer). Experimental results for 2-speaker mixtures, 3-speaker mixtures, and noisy mixtures of 2-speakers show that among the proposed separator networks, the TCN-Conformer significantly improves the target speaker extraction performance compared to the Conformer-FFN and a TCN-based baseline system.
引用
收藏
页数:5
相关论文
共 50 条
  • [31] Speaker Verification Based on Single Channel Speech Separation
    Jin, Rong
    Ablimit, Mijit
    Hamdulla, Askar
    IEEE ACCESS, 2023, 11 : 112631 - 112638
  • [32] Assessment of Single-Channel Speech Enhancement Techniques for Speaker Identification under Mismatched Conditions
    Sadjadi, Seyed Omid
    Hansen, John H. L.
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2138 - 2141
  • [33] SPEAKERFILTER: DEEP LEARNING-BASED TARGET SPEAKER EXTRACTION USING ANCHOR SPEECH
    He, Shulin
    Li, Hao
    Zhang, Xueliang
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 376 - 380
  • [34] Rapid Unsupervised Speaker Adaptation Using Single Utterance Based on MLLR and Speaker Selection
    Gomez, Randy
    Toda, Tomoki
    Saruwatari, Hiroshi
    Shikano, Kiyohiro
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1365 - 1368
  • [35] Using audio and visual information for single channel speaker separation
    Khan, Faheem
    Milner, Ben
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1517 - 1521
  • [36] Contrastive Learning for Target Speaker Extraction With Attention-Based Fusion
    Li, Xiao
    Liu, Ruirui
    Huang, Huichou
    Wu, Qingyao
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 178 - 188
  • [37] SOURCE-AWARE CONTEXT NETWORK FOR SINGLE-CHANNEL MULTI-SPEAKER SPEECH SEPARATION
    Li, Zeng-Xi
    Song, Yan
    Dai, Li-Rong
    McLoughlin, Ian
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 681 - 685
  • [38] End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation
    Zuluaga-Gomez, Juan
    Huang, Zhaocheng
    Niu, Xing
    Paturi, Rohit
    Srinavasan, Sundararajan
    Mathur, Prashant
    Thompson, Brian
    Federico, Marcello
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 7255 - 7274
  • [39] Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition
    Aditya Arie Nugraha
    Kazumasa Yamamoto
    Seiichi Nakagawa
    EURASIP Journal on Audio, Speech, and Music Processing, 2014
  • [40] Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition
    Nugraha, Aditya Arie
    Yamamoto, Kazumasa
    Nakagawa, Seiichi
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2014,