ZERO-SHOT PERSONALIZED SPEECH ENHANCEMENT THROUGH SPEAKER-INFORMED MODEL SELECTION

被引:4
|
作者
Sivaraman, Aswin [1 ]
Kim, Minje [1 ]
机构
[1] Indiana Univ, Dept Intelligent Syst Engn, Bloomington, IN 47405 USA
基金
美国国家科学基金会;
关键词
Speech enhancement; deep learning; adaptive mixture of local experts; model compression by selection; NEURAL-NETWORKS; ADAPTATION;
D O I
10.1109/WASPAA52581.2021.9632752
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents a novel zero-shot learning approach towards personalized speech enhancement through the use of a sparsely active ensemble model. Optimizing speech denoising systems towards a particular test-time speaker can improve performance and reduce run-time complexity. However, test-time model adaptation may be challenging if collecting data from the test-time speaker is not possible. To this end, we propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers. The gating module inexpensively estimates test-time speaker characteristics in the form of an embedding vector and selects the most appropriate specialist module for denoising the test signal. Grouping the training set speakers into non-overlapping semantically similar groups is non-trivial and ill-defined. To do this, we first train a Siamese network using noisy speech pairs to maximize or minimize the similarity of its output vectors depending on whether the utterances derive from the same speaker or not. Next, we perform k-means clustering on the latent space formed by the averaged embedding vectors per training set speaker. In this way, we designate speaker groups and train specialist modules optimized around partitions of the complete training set. Our experiments show that ensemble models made up of low-capacity specialists can out-perform high-capacity generalist models with greater efficiency and improved adaptation towards unseen test-time speakers.
引用
收藏
页码:171 / 175
页数:5
相关论文
共 50 条
  • [31] Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis
    Zhang, Mingyang
    Zhou, Xuehao
    Wu, Zhizheng
    Li, Haizhou
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 947 - 951
  • [32] Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech
    Avdeeva, Anastasia
    Gusev, Aleksei
    INTERSPEECH 2024, 2024, : 2735 - 2739
  • [33] VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in theWild
    Peng, Puyuan
    Huang, Po-Yao
    Le, Shang-Wen
    Mohamed, Abdelrahman
    Harwath, David
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12442 - 12462
  • [34] ADEQ: Adaptive Diversity Enhancement for Zero-Shot Quantization
    Chen, Xinrui
    Yan, Renao
    Cheng, Junru
    Wang, Yizhi
    Fu, Yuqiu
    Chen, Yi
    Guan, Tian
    He, Yonghong
    NEURAL INFORMATION PROCESSING, ICONIP 2023, PT I, 2024, 14447 : 53 - 64
  • [35] Denoised and Dynamic Alignment Enhancement for Zero-Shot Learning
    Ge, Jiannan
    Liu, Zhihang
    Li, Pandeng
    Xie, Lingxi
    Zhang, Yongdong
    Tian, Qi
    Xie, Hongtao
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 1501 - 1515
  • [36] Synthetic Sample Selection for Generalized Zero-Shot Learning
    Gowda, Shreyank N.
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2023, : 58 - 67
  • [37] AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
    Wu, Yihan
    Tan, Xu
    Li, Bohan
    He, Lei
    Zhao, Sheng
    Song, Ruihua
    Qin, Tao
    Liu, Tie-Yan
    INTERSPEECH 2022, 2022, : 2568 - 2572
  • [38] A Joint Generative Model for Zero-Shot Learning
    Gao, Rui
    Hou, Xingsong
    Qin, Jie
    Liu, Li
    Zhu, Fan
    Zhang, Zhao
    COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 631 - 646
  • [39] Vocabulary-Informed Zero-Shot and Open-Set Learning
    Fu, Yanwei
    Wang, Xiaomei
    Dong, Hanze
    Jiang, Yu-Gang
    Wang, Meng
    Xue, Xiangyang
    Sigal, Leonid
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (12) : 3136 - 3152
  • [40] Autonomous Emotion Learning in Speech: A View of Zero-Shot Speech Emotion Recognition
    Xu, Xinzhou
    Deng, Jun
    Cummins, Nicholas
    Zhang, Zixing
    Zhao, Li
    Schuller, Bjorn W.
    INTERSPEECH 2019, 2019, : 949 - 953