Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

被引：0

作者：

Skerry-Ryan, R. J. ^{[1
]}

Battenberg, Eric ^{[1
]}

Xiao, Ying ^{[1
]}

Wang, Yuxuan ^{[1
]}

Stanton, Daisy ^{[1
]}

Shor, Joel ^{[1
]}

Weiss, Ron J. ^{[1
]}

Clark, Rob ^{[1
]}

Saurous, Rif A. ^{[1
]}

机构：

[1] Google Inc, Mountain View, CA 94043 USA

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80 | 2018年 / 80卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.

引用

页数：10

共 50 条

[21] PROSODIC CLUSTERING FOR PHONEME-LEVEL PROSODY CONTROL IN END-TO-END SPEECH SYNTHESIS
Vioni, Alexandra
Christidou, Myrsini
Ellinas, Nikolaos
Vamvoukakis, Georgios
Kakoulidis, Panos
Kim, Taehoon
Sung, June Sig
Park, Hyoungmin
Chalamandaris, Aimilios
Tsiakoulis, Pirros
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5719 - 5723
[22] Myanmar Text-to-Speech System based on Tacotron (End-to-End Generative Model)
Win, Yuzana
Lwin, Htoo Pyae
Masada, Tomonari
11TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE: DATA, NETWORK, AND AI IN THE AGE OF UNTACT (ICTC 2020), 2020, : 572 - 577
[23] BI-LEVEL STYLE AND PROSODY DECOUPLING MODELING FOR PERSONALIZED END-TO-END SPEECH SYNTHESIS
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
Yi, Jiangyan
Wang, Tao
Qiang, Chunyu
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6568 - 6572
[24] IMPROVING PROSODY MODELLING WITH CROSS-UTTERANCE BERT EMBEDDINGS FOR END-TO-END SPEECH SYNTHESIS
Xii, Guanghui
Song, Wei
Zhang, Zhengchen
Zhang, Chao
He, Xiaodong
Zhou, Bowen
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6079 - 6083
[25] Towards End-to-End Synthetic Speech Detection
Hua, Guang
Teoh, Andrew Beng Jin
Zhang, Haijian
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 (28) : 1265 - 1269
[26] End-to-End Binaural Speech Synthesis
Huang, Wen-Chin
Markovic, Dejan
Gebru, Israel D.
Menon, Anjali
Richard, Alexander
INTERSPEECH 2022, 2022, : 1218 - 1222
[27] TOWARDS END-TO-END UNSUPERVISED SPEECH RECOGNITION
Liu, Alexander H.
Hsu, Wei-Ning
Auli, Michael
Baevski, Alexei
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 221 - 228
[28] Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2
Mandeel, Ali Raheem
Al-Radhi, Mohammed Salah
Csapo, Tamas Gabor
INFOCOMMUNICATIONS JOURNAL, 2022, 14 (03): : 55 - 62
[29] SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody
Lu, Hui
Wu, Xixin
Wu, Zhiyong
Meng, Helen
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2829 - 2837
[30] Towards End-to-End Speech-to-Text Summarization
Monteiro, Raul
Pernes, Diogo
TEXT, SPEECH, AND DIALOGUE, TSD 2023, 2023, 14102 : 304 - 316

← 1 2 3 4 5 →