CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

被引：43

作者：

Karlapati, Sri ^{[1
]}

Moinet, Alexis ^{[1
]}

Joly, Arnaud ^{[1
]}

Klimkov, Viacheslav ^{[1
]}

Sciez-Trigueros, Daniel ^{[1
]}

Drugman, Thomas ^{[1
]}

机构：

[1] Amazon Res, Cambridge, England

来源：

INTERSPEECH 2020 | 2020年

关键词：

Neural text-to-speech; fine-grained prosody transfer; many-to-many prosody transfer;

D O I：

10.21437/Interspeech.2020-1251

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained PT suffer from source speaker leakage, where the synthesised speech has the voice identity of the source speaker as opposed to the target speaker. In order to mitigate this issue, they compromise on the quality of PT. In this paper, we propose CopyCat, a novel, many-to-many PT system that is robust to source speaker leakage, without using parallel data. We achieve this through a novel reference encoder architecture capable of capturing temporal prosodic representations which are robust to source speaker leakage. We compare CopyCat against a state-of-the-art fine-grained PT model through various subjective evaluations, where we show a relative improvement of 47% in the quality of prosody transfer and 14% in preserving the target speaker identity, while still maintaining the same naturalness.

引用

页码：4387 / 4391

页数：5

共 50 条

[1] CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer
Karlapati, Sri
Karanasou, Penny
Lajszczak, Mateusz
Abbas, Ammar
Moinet, Alexis
Makarov, Peter
Li, Ray
van Korlaar, Arent
Slangen, Simon
Drugman, Thomas
INTERSPEECH 2022, 2022, : 3363 - 3367
[2] Fine-Grained Robust Prosody Transfer for Single-Speaker Neural Text-To-Speech
Klimkov, Viacheslav
Ronanki, Srikanth
Rohnke, Jonas
Drugman, Thomas
INTERSPEECH 2019, 2019, : 4440 - 4444
[3] eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer
Abbas, Ammar
Karlapati, Sri
Schnell, Bastian
Karanasou, Penny
Moya, Marcel Granero
Nagaraj, Amith
Boustati, Ayman
Peinelt, Nicole
Moinet, Alexis
Drugman, Thomas
INTERSPEECH 2023, 2023, : 3387 - 3391
[4] NON-PARALLEL MANY-TO-MANY VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM A TEXT-TO-SPEECH MODEL
Yu, Xinyuan
Mak, Brian
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5924 - 5928
[5] GENERATING DIVERSE AND NATURAL TEXT-TO-SPEECH SAMPLES USING A QUANTIZED FINE-GRAINED VAE AND AUTOREGRESSIVE PROSODY PRIOR
Sun, Guangzhi
Zhang, Yu
Weiss, Ron J.
Cao, Yuan
Zen, Heiga
Rosenberg, Andrew
Ramabhadran, Bhuvana
Wu, Yonghui
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6699 - 6703
[6] Multi-stage attention for fine-grained expressivity transfer in multispeaker text-to-speech system
Kulkarni, Ajinkya
Colotte, Vincent
Jouvet, Denis
2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 180 - 184
[7] FINE-GRAINED STYLE CONTROL IN TRANSFORMER-BASED TEXT-TO-SPEECH SYNTHESIS
Chen, Li-Wei
Rudnicky, Alexander
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7907 - 7911
[8] Fine-grained prosody modeling in neural speech synthesis using ToBI representation
Zou, Yuxiang
Liu, Shichao
Yin, Xiang
Lin, Haopeng
Wang, Chunfeng
Zhang, Haoyu
Ma, Zejun
INTERSPEECH 2021, 2021, : 3146 - 3150
[9] PROSODYSPEECH: TOWARDS ADVANCED PROSODY MODEL FOR NEURAL TEXT-TO-SPEECH
Yi, Yuanhao
He, Lei
Pan, Shifeng
Wang, Xi
Xiao, Yujia
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7582 - 7586
[10] EMOQ-TTS: EMOTION INTENSITY QUANTIZATION FOR FINE-GRAINED CONTROLLABLE EMOTIONAL TEXT-TO-SPEECH
Im, Chae-Bin
Lee, Sang-Hoon
Kim, Seung-Bin
Lee, Seong-Whan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6317 - 6321

← 1 2 3 4 5 →