Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

被引:0
|
作者
Skerry-Ryan, R. J. [1 ]
Battenberg, Eric [1 ]
Xiao, Ying [1 ]
Wang, Yuxuan [1 ]
Stanton, Daisy [1 ]
Shor, Joel [1 ]
Weiss, Ron J. [1 ]
Clark, Rob [1 ]
Saurous, Rif A. [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
来源
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80 | 2018年 / 80卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] End-to-End Speech Synthesis for Tibetan Multidialect
    Xu, Xiaona
    Yang, Li
    Zhao, Yue
    Wang, Hui
    COMPLEXITY, 2021, 2021
  • [32] Improving transfer of expressivity for end-to-end multispeaker text-to-speech synthesis
    Kulkarni, Ajinkya
    Colotte, Vincent
    Jouvet, Denis
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 31 - 35
  • [33] Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis
    Li, Tao
    Wang, Xinsheng
    Xie, Qicong
    Wang, Zhichao
    Xie, Lei
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1448 - 1460
  • [34] End-to-end Indonesian Speech Synthesis Based On Transfer Learning And Alternate Training
    Lu, Yu
    Yang, Jian
    Yang, Ruolin
    2021 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2021, : 30 - 35
  • [35] LEARNING LATENT REPRESENTATIONS FOR STYLE CONTROL AND TRANSFER IN END-TO-END SPEECH SYNTHESIS
    Zhang, Ya-Jie
    Pan, Shifeng
    He, Lei
    Ling, Zhen-Hua
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6945 - 6949
  • [36] End-to-end Tibetan emotional speech synthesis based on Mandarin emotions transfer
    Zhang, Weizhao
    Zhang, Wenxuan
    2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024, 2024, : 320 - 325
  • [37] Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
    Wang, Yuxuan
    Stanton, Daisy
    Zhang, Yu
    Skerry-Ryan, R. J.
    Battenberg, Eric
    Shor, Joel
    Xiao, Ying
    Ren, Fei
    Jia, Ye
    Saurous, Rif A.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80, 2018, 80
  • [38] TOWARDS LANGUAGE-UNIVERSAL END-TO-END SPEECH RECOGNITION
    Kim, Suyoun
    Seltzer, Michael L.
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4914 - 4918
  • [39] Towards a Deep Understanding of Multilingual End-to-End Speech Translation
    Sun, Haoran
    Zhao, Xiaohu
    Lei, Yikun
    Zhu, Shaolin
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14332 - 14348
  • [40] Towards End-to-End Speech Recognition with Recurrent Neural Networks
    Graves, Alex
    Jaitly, Navdeep
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 2), 2014, 32 : 1764 - 1772