ZERO-SHOT PERSONALIZED SPEECH ENHANCEMENT THROUGH SPEAKER-INFORMED MODEL SELECTION

被引:4
|
作者
Sivaraman, Aswin [1 ]
Kim, Minje [1 ]
机构
[1] Indiana Univ, Dept Intelligent Syst Engn, Bloomington, IN 47405 USA
基金
美国国家科学基金会;
关键词
Speech enhancement; deep learning; adaptive mixture of local experts; model compression by selection; NEURAL-NETWORKS; ADAPTATION;
D O I
10.1109/WASPAA52581.2021.9632752
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents a novel zero-shot learning approach towards personalized speech enhancement through the use of a sparsely active ensemble model. Optimizing speech denoising systems towards a particular test-time speaker can improve performance and reduce run-time complexity. However, test-time model adaptation may be challenging if collecting data from the test-time speaker is not possible. To this end, we propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers. The gating module inexpensively estimates test-time speaker characteristics in the form of an embedding vector and selects the most appropriate specialist module for denoising the test signal. Grouping the training set speakers into non-overlapping semantically similar groups is non-trivial and ill-defined. To do this, we first train a Siamese network using noisy speech pairs to maximize or minimize the similarity of its output vectors depending on whether the utterances derive from the same speaker or not. Next, we perform k-means clustering on the latent space formed by the averaged embedding vectors per training set speaker. In this way, we designate speaker groups and train specialist modules optimized around partitions of the complete training set. Our experiments show that ensemble models made up of low-capacity specialists can out-perform high-capacity generalist models with greater efficiency and improved adaptation towards unseen test-time speakers.
引用
收藏
页码:171 / 175
页数:5
相关论文
共 50 条
  • [1] Speech Enhancement with Zero-Shot Model Selection
    Zezario, Ryandhimas E.
    Fuh, Chiou-Shann
    Wang, Hsin-Min
    Tsao, Yu
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 491 - 495
  • [2] Normalization Driven Zero-shot Multi-Speaker Speech Synthesis
    Kumar, Neeraj
    Goel, Srishti
    Narang, Ankur
    Lall, Brejesh
    INTERSPEECH 2021, 2021, : 1354 - 1358
  • [3] Multi-Scale Speaker Vectors for Zero-Shot Speech Synthesis
    Cory, Tristin
    Iqbal, Razib
    2022 IEEE 46TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2022), 2022, : 496 - 501
  • [4] Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations
    Wang, Wenbin
    Song, Yang
    Jha, Sanjay
    INTERSPEECH 2023, 2023, : 4454 - 4458
  • [5] TEST-TIME ADAPTATION TOWARD PERSONALIZED SPEECH ENHANCEMENT: ZERO-SHOT LEARNING WITH KNOWLEDGE DISTILLATION
    Kim, Sunwoo
    Kim, Minje
    2021 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2021, : 176 - 180
  • [6] Model Selection for Generalized Zero-Shot Learning
    Zhang, Hongguang
    Koniusz, Piotr
    COMPUTER VISION - ECCV 2018 WORKSHOPS, PT II, 2019, 11130 : 198 - 204
  • [7] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
    Kumar, Neeraj
    Narang, Ankur
    Lall, Brejesh
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
  • [8] Hierarchical Timbre-Cadence Speaker Encoder for Zero-shot Speech Synthesis
    Lee, Joun Yeop
    Bae, Jae-Sung
    Mun, Seongkyu
    Lee, Jihwan
    Lee, Ji-Hyun
    Cho, Hoon-Young
    Kim, Chanwoo
    INTERSPEECH 2023, 2023, : 4334 - 4338
  • [9] SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
    Casanova, Edresson
    Shulby, Christopher
    Golge, Eren
    Muller, Nicolas Michael
    de Oliveira, Frederico Santos
    Candido Junior, Arnaldo
    Soares, Anderson da Silva
    Aluisio, Sandra Maria
    Ponti, Moacir Antonelli
    INTERSPEECH 2021, 2021, : 3645 - 3649
  • [10] ZeroST: Zero-Shot Speech Translation
    Khurana, Sameer
    Horii, Chiori
    Laurent, Antoine
    Wichern, Gordon
    Le Roux, Jonathan
    INTERSPEECH 2024, 2024, : 392 - 396