ZERO-SHOT PERSONALIZED SPEECH ENHANCEMENT THROUGH SPEAKER-INFORMED MODEL SELECTION

被引：4

作者：

Sivaraman, Aswin ^{[1
]}

Kim, Minje ^{[1
]}

机构：

[1] Indiana Univ, Dept Intelligent Syst Engn, Bloomington, IN 47405 USA

来源：

2021 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA) | 2021年

基金：

美国国家科学基金会;

关键词：

Speech enhancement; deep learning; adaptive mixture of local experts; model compression by selection; NEURAL-NETWORKS; ADAPTATION;

D O I：

10.1109/WASPAA52581.2021.9632752

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper presents a novel zero-shot learning approach towards personalized speech enhancement through the use of a sparsely active ensemble model. Optimizing speech denoising systems towards a particular test-time speaker can improve performance and reduce run-time complexity. However, test-time model adaptation may be challenging if collecting data from the test-time speaker is not possible. To this end, we propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers. The gating module inexpensively estimates test-time speaker characteristics in the form of an embedding vector and selects the most appropriate specialist module for denoising the test signal. Grouping the training set speakers into non-overlapping semantically similar groups is non-trivial and ill-defined. To do this, we first train a Siamese network using noisy speech pairs to maximize or minimize the similarity of its output vectors depending on whether the utterances derive from the same speaker or not. Next, we perform k-means clustering on the latent space formed by the averaged embedding vectors per training set speaker. In this way, we designate speaker groups and train specialist modules optimized around partitions of the complete training set. Our experiments show that ensemble models made up of low-capacity specialists can out-perform high-capacity generalist models with greater efficiency and improved adaptation towards unseen test-time speakers.

引用

页码：171 / 175

页数：5

共 50 条

[1] Speech Enhancement with Zero-Shot Model Selection
Zezario, Ryandhimas E.
Fuh, Chiou-Shann
Wang, Hsin-Min
Tsao, Yu
29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 491 - 495
[2] Normalization Driven Zero-shot Multi-Speaker Speech Synthesis
Kumar, Neeraj
Goel, Srishti
Narang, Ankur
Lall, Brejesh
INTERSPEECH 2021, 2021, : 1354 - 1358
[3] Multi-Scale Speaker Vectors for Zero-Shot Speech Synthesis
Cory, Tristin
Iqbal, Razib
2022 IEEE 46TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2022), 2022, : 496 - 501
[4] Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations
Wang, Wenbin
Song, Yang
Jha, Sanjay
INTERSPEECH 2023, 2023, : 4454 - 4458
[5] TEST-TIME ADAPTATION TOWARD PERSONALIZED SPEECH ENHANCEMENT: ZERO-SHOT LEARNING WITH KNOWLEDGE DISTILLATION
Kim, Sunwoo
Kim, Minje
2021 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2021, : 176 - 180
[6] Model Selection for Generalized Zero-Shot Learning
Zhang, Hongguang
Koniusz, Piotr
COMPUTER VISION - ECCV 2018 WORKSHOPS, PT II, 2019, 11130 : 198 - 204
[7] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
Kumar, Neeraj
Narang, Ankur
Lall, Brejesh
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
[8] Hierarchical Timbre-Cadence Speaker Encoder for Zero-shot Speech Synthesis
Lee, Joun Yeop
Bae, Jae-Sung
Mun, Seongkyu
Lee, Jihwan
Lee, Ji-Hyun
Cho, Hoon-Young
Kim, Chanwoo
INTERSPEECH 2023, 2023, : 4334 - 4338
[9] SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
Casanova, Edresson
Shulby, Christopher
Golge, Eren
Muller, Nicolas Michael
de Oliveira, Frederico Santos
Candido Junior, Arnaldo
Soares, Anderson da Silva
Aluisio, Sandra Maria
Ponti, Moacir Antonelli
INTERSPEECH 2021, 2021, : 3645 - 3649
[10] ZeroST: Zero-Shot Speech Translation
Khurana, Sameer
Horii, Chiori
Laurent, Antoine
Wichern, Gordon
Le Roux, Jonathan
INTERSPEECH 2024, 2024, : 392 - 396

← 1 2 3 4 5 →