SPEAKER-CONDITIONING SINGLE-CHANNEL TARGET SPEAKER EXTRACTION USING CONFORMER-BASED ARCHITECTURES

被引：1

作者：

Sinha, Ragini ^{[1
]}

Tammen, Marvin ^{[2
,3
]}

Rollwage, Christian ^{[1
]}

Doclo, Simon ^{[1
,2
,3
]}

机构：

[1] Fraunhofer Inst Digital Media Technol IDMT, Oldenburg Branch Hearing Speech & Audio Technol H, Ilmenau, Germany

[2] Carl von Ossietzky Univ Oldenburg, Dept Med Phys & Acoust, Oldenburg, Germany

[3] Carl von Ossietzky Univ Oldenburg, Cluster Excellence Hearing4all, Oldenburg, Germany

来源：

2022 INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC 2022) | 2022年

关键词：

target speaker extraction; multi-task learning; TCN; attention; conformer;

D O I：

10.1109/IWAENC53105.2022.9914691

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Target speaker extraction aims at extracting the target speaker from a mixture of multiple speakers exploiting auxiliary information about the target speaker. In this paper, we consider a complete time-domain target speaker extraction system consisting of a speaker embedder network and a speaker separator network which are jointly trained in an end-to-end learning process. We propose two different architectures for the speaker separator network which are based on the convolutional augmented transformer (conformer). The first architecture uses stacks of conformer and external feed-forward blocks (Conformer-FFN), while the second architecture uses stacks of temporal convolutional network (TCN) and conformer blocks (TCN-Conformer). Experimental results for 2-speaker mixtures, 3-speaker mixtures, and noisy mixtures of 2-speakers show that among the proposed separator networks, the TCN-Conformer significantly improves the target speaker extraction performance compared to the Conformer-FFN and a TCN-based baseline system.

引用

页数：5

共 50 条

[1] Variants of LSTM cells for single-channel speaker-conditioned target speaker extraction
Ragini Sinha
Christian Rollwage
Simon Doclo
EURASIP Journal on Audio, Speech, and Music Processing, 2024 (1)
[2] Single-Channel Target Speaker Extraction System with Attention Enhancement
Lai, Yen-Ting
Lin, Yi-En
Chang, Pao-Chi
Wang, Jia-Ching
2022 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN, IEEE ICCE-TW 2022, 2022, : 433 - 434
[3] SINGLE CHANNEL TARGET SPEAKER EXTRACTION AND RECOGNITION WITH SPEAKER BEAM
Delcroix, Marc
Zmolikova, Katerina
Kinoshita, Keisuke
Ogawa, Atsunori
Nakatani, Tomohiro
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5554 - 5558
[4] SINGLE-CHANNEL SPEECH EXTRACTION USING SPEAKER INVENTORY AND ATTENTION NETWORK
Xiao, Xiong
Chen, Zhuo
Yoshioka, Takuya
Erdogan, Hakan
Liu, Changliang
Dimitriadis, Dimitrios
Droppo, Jasha
Gong, Yifan
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 86 - 90
[5] SINGLE-CHANNEL SPEAKER DIARIZATION BASED ON SPATIAL FEATURES
Hu, Mathieu
Parada, Pablo Peso
Sharma, Dushyant
Doclo, Simon
van Waterschoot, Toon
Brookes, Mike
Naylor, Patrick A.
2015 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2015,
[6] Conformer-Based Speaker Recognition Model for Real-Time Multi-Scenarios
Xuan, Xi
Han, Runping
Gao, Jingxin
Computer Engineering and Applications, 2024, 60 (07) : 147 - 156
[7] Speaker Separation Using Visual Speech Features and Single-channel Audio
Khan, Faheem
Milner, Ben
14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 3263 - 3267
[8] Single-Channel Multi-Speaker Separation using Deep Clustering
Isik, Yusuf
Le Roux, Jonathan
Chen, Zhuo
Watanabe, Shinji
Hershey, John R.
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 545 - 549
[9] Speaker Verification-Based Evaluation of Single-Channel Speech Separation
Maciejewski, Matthew
Watanabe, Shinji
Khudanpur, Sanjeev
INTERSPEECH 2021, 2021, : 3520 - 3524
[10] Soft mask methods for single-channel speaker separation
Reddy, Aarthi M.
Raj, Bhiksha
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (06): : 1766 - 1776

← 1 2 3 4 5 →