ZERO-SHOT PERSONALIZED SPEECH ENHANCEMENT THROUGH SPEAKER-INFORMED MODEL SELECTION

被引：4

作者：

Sivaraman, Aswin ^{[1
]}

Kim, Minje ^{[1
]}

机构：

[1] Indiana Univ, Dept Intelligent Syst Engn, Bloomington, IN 47405 USA

来源：

2021 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA) | 2021年

基金：

美国国家科学基金会;

关键词：

Speech enhancement; deep learning; adaptive mixture of local experts; model compression by selection; NEURAL-NETWORKS; ADAPTATION;

D O I：

10.1109/WASPAA52581.2021.9632752

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper presents a novel zero-shot learning approach towards personalized speech enhancement through the use of a sparsely active ensemble model. Optimizing speech denoising systems towards a particular test-time speaker can improve performance and reduce run-time complexity. However, test-time model adaptation may be challenging if collecting data from the test-time speaker is not possible. To this end, we propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers. The gating module inexpensively estimates test-time speaker characteristics in the form of an embedding vector and selects the most appropriate specialist module for denoising the test signal. Grouping the training set speakers into non-overlapping semantically similar groups is non-trivial and ill-defined. To do this, we first train a Siamese network using noisy speech pairs to maximize or minimize the similarity of its output vectors depending on whether the utterances derive from the same speaker or not. Next, we perform k-means clustering on the latent space formed by the averaged embedding vectors per training set speaker. In this way, we designate speaker groups and train specialist modules optimized around partitions of the complete training set. Our experiments show that ensemble models made up of low-capacity specialists can out-perform high-capacity generalist models with greater efficiency and improved adaptation towards unseen test-time speakers.

引用

页码：171 / 175

页数：5

共 50 条

[31] Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis
Zhang, Mingyang
Zhou, Xuehao
Wu, Zhizheng
Li, Haizhou
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 947 - 951
[32] Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech
Avdeeva, Anastasia
Gusev, Aleksei
INTERSPEECH 2024, 2024, : 2735 - 2739
[33] VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in theWild
Peng, Puyuan
Huang, Po-Yao
Le, Shang-Wen
Mohamed, Abdelrahman
Harwath, David
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12442 - 12462
[34] ADEQ: Adaptive Diversity Enhancement for Zero-Shot Quantization
Chen, Xinrui
Yan, Renao
Cheng, Junru
Wang, Yizhi
Fu, Yuqiu
Chen, Yi
Guan, Tian
He, Yonghong
NEURAL INFORMATION PROCESSING, ICONIP 2023, PT I, 2024, 14447 : 53 - 64
[35] Denoised and Dynamic Alignment Enhancement for Zero-Shot Learning
Ge, Jiannan
Liu, Zhihang
Li, Pandeng
Xie, Lingxi
Zhang, Yongdong
Tian, Qi
Xie, Hongtao
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 1501 - 1515
[36] Synthetic Sample Selection for Generalized Zero-Shot Learning
Gowda, Shreyank N.
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2023, : 58 - 67
[37] AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
Wu, Yihan
Tan, Xu
Li, Bohan
He, Lei
Zhao, Sheng
Song, Ruihua
Qin, Tao
Liu, Tie-Yan
INTERSPEECH 2022, 2022, : 2568 - 2572
[38] A Joint Generative Model for Zero-Shot Learning
Gao, Rui
Hou, Xingsong
Qin, Jie
Liu, Li
Zhu, Fan
Zhang, Zhao
COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 631 - 646
[39] Vocabulary-Informed Zero-Shot and Open-Set Learning
Fu, Yanwei
Wang, Xiaomei
Dong, Hanze
Jiang, Yu-Gang
Wang, Meng
Xue, Xiangyang
Sigal, Leonid
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (12) : 3136 - 3152
[40] Autonomous Emotion Learning in Speech: A View of Zero-Shot Speech Emotion Recognition
Xu, Xinzhou
Deng, Jun
Cummins, Nicholas
Zhang, Zixing
Zhao, Li
Schuller, Bjorn W.
INTERSPEECH 2019, 2019, : 949 - 953

← 1 2 3 4 5 →