Normalization Driven Zero-shot Multi-Speaker Speech Synthesis

被引：7

作者：

Kumar, Neeraj ^{[1
,2
]}

Goel, Srishti ^{[1
]}

Narang, Ankur ^{[1
]}

Lall, Brejesh ^{[2
]}

机构：

[1] Hike Private Ltd, New Delhi, India

[2] Indian Inst Technol, Delhi, India

来源：

INTERSPEECH 2021 | 2021年

关键词：

Speech synthesis; normalization; transfer learning; wav2vec2.0 based speaker encoder; angular softmax;

D O I：

10.21437/Interspeech.2021-441

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

In this paper, we present a novel zero-shot multi-speaker speech synthesis approach (ZSM-SS) that leverages the normalization architecture and speaker encoder with non-autoregressive multi-head attention driven encoder-decoder architecture.Given an input text and a reference speech sample of an unseen person, ZSM-SS can generate speech in that person's style in a zero-shot manner. Additionally, we demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency in a disentangled fashion and can be used to generate morphed speech output. We demonstrate the efficacy of our proposed architecture on multi-speaker VCTK[1] and LibriTTS [2] datasets, using multiple quantitative metrics that measure generated speech distortion and MOS, along with speaker embedding analysis of the proposed speaker encoder model.

引用

页码：1354 / 1358

页数：5

共 50 条

[1] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
Kumar, Neeraj
Narang, Ankur
Lall, Brejesh
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
[2] Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis
Zhang, Mingyang
Zhou, Xuehao
Wu, Zhizheng
Li, Haizhou
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 947 - 951
[3] Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations
Jeon, Yejin
Kim, Yunsu
Lee, Gary Geunbae
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18336 - 18344
[4] YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone
Casanova, Edresson
Weber, Julian
Shulby, Christopher
Candido Junior, Arnaldo
Goelge, Eren
Ponti, Moacir Antonelli
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[5] ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH WITH STATE-OF-THE-ART NEURAL SPEAKER EMBEDDINGS
Cooper, Erica
Lai, Cheng-, I
Yasuda, Yusuke
Fang, Fuming
Wang, Xin
Chen, Nanxin
Yamagishi, Junichi
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6184 - 6188
[6] Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
Yoon, Hyungchan
Kim, Changhwan
Song, Eunwoo
Yoon, Hyun-Wook
Kang, Hong-Goo
INTERSPEECH 2023, 2023, : 4299 - 4303
[7] Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder
Bang, Chae-Woon
Chun, Chanjun
SENSORS, 2023, 23 (23)
[8] SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
Casanova, Edresson
Shulby, Christopher
Golge, Eren
Muller, Nicolas Michael
de Oliveira, Frederico Santos
Candido Junior, Arnaldo
Soares, Anderson da Silva
Aluisio, Sandra Maria
Ponti, Moacir Antonelli
INTERSPEECH 2021, 2021, : 3645 - 3649
[9] NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH
Zhao, Botao
Zhang, Xulong
Wang, Jianzong
Cheng, Ning
Xiao, Jing
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4293 - 4297
[10] Multi-Scale Speaker Vectors for Zero-Shot Speech Synthesis
Cory, Tristin
Iqbal, Razib
2022 IEEE 46TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2022), 2022, : 496 - 501

← 1 2 3 4 5 →