Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

被引:0
|
作者
Skerry-Ryan, R. J. [1 ]
Battenberg, Eric [1 ]
Xiao, Ying [1 ]
Wang, Yuxuan [1 ]
Stanton, Daisy [1 ]
Shor, Joel [1 ]
Weiss, Ron J. [1 ]
Clark, Rob [1 ]
Saurous, Rif A. [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] PROSODIC CLUSTERING FOR PHONEME-LEVEL PROSODY CONTROL IN END-TO-END SPEECH SYNTHESIS
    Vioni, Alexandra
    Christidou, Myrsini
    Ellinas, Nikolaos
    Vamvoukakis, Georgios
    Kakoulidis, Panos
    Kim, Taehoon
    Sung, June Sig
    Park, Hyoungmin
    Chalamandaris, Aimilios
    Tsiakoulis, Pirros
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5719 - 5723
  • [22] Myanmar Text-to-Speech System based on Tacotron (End-to-End Generative Model)
    Win, Yuzana
    Lwin, Htoo Pyae
    Masada, Tomonari
    11TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE: DATA, NETWORK, AND AI IN THE AGE OF UNTACT (ICTC 2020), 2020, : 572 - 577
  • [23] BI-LEVEL STYLE AND PROSODY DECOUPLING MODELING FOR PERSONALIZED END-TO-END SPEECH SYNTHESIS
    Fu, Ruibo
    Tao, Jianhua
    Wen, Zhengqi
    Yi, Jiangyan
    Wang, Tao
    Qiang, Chunyu
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6568 - 6572
  • [24] IMPROVING PROSODY MODELLING WITH CROSS-UTTERANCE BERT EMBEDDINGS FOR END-TO-END SPEECH SYNTHESIS
    Xii, Guanghui
    Song, Wei
    Zhang, Zhengchen
    Zhang, Chao
    He, Xiaodong
    Zhou, Bowen
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6079 - 6083
  • [25] Towards End-to-End Synthetic Speech Detection
    Hua, Guang
    Teoh, Andrew Beng Jin
    Zhang, Haijian
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 (28) : 1265 - 1269
  • [26] End-to-End Binaural Speech Synthesis
    Huang, Wen-Chin
    Markovic, Dejan
    Gebru, Israel D.
    Menon, Anjali
    Richard, Alexander
    INTERSPEECH 2022, 2022, : 1218 - 1222
  • [27] TOWARDS END-TO-END UNSUPERVISED SPEECH RECOGNITION
    Liu, Alexander H.
    Hsu, Wei-Ning
    Auli, Michael
    Baevski, Alexei
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 221 - 228
  • [28] Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2
    Mandeel, Ali Raheem
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    INFOCOMMUNICATIONS JOURNAL, 2022, 14 (03): : 55 - 62
  • [29] SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody
    Lu, Hui
    Wu, Xixin
    Wu, Zhiyong
    Meng, Helen
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2829 - 2837
  • [30] Towards End-to-End Speech-to-Text Summarization
    Monteiro, Raul
    Pernes, Diogo
    TEXT, SPEECH, AND DIALOGUE, TSD 2023, 2023, 14102 : 304 - 316