Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

被引：0

作者：

Skerry-Ryan, R. J. ^{[1
]}

Battenberg, Eric ^{[1
]}

Xiao, Ying ^{[1
]}

Wang, Yuxuan ^{[1
]}

Stanton, Daisy ^{[1
]}

Shor, Joel ^{[1
]}

Weiss, Ron J. ^{[1
]}

Clark, Rob ^{[1
]}

Saurous, Rif A. ^{[1
]}

机构：

[1] Google Inc, Mountain View, CA 94043 USA

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80 | 2018年 / 80卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.

引用

页数：10

共 50 条

[31] End-to-End Speech Synthesis for Tibetan Multidialect
Xu, Xiaona
Yang, Li
Zhao, Yue
Wang, Hui
COMPLEXITY, 2021, 2021
[32] Improving transfer of expressivity for end-to-end multispeaker text-to-speech synthesis
Kulkarni, Ajinkya
Colotte, Vincent
Jouvet, Denis
29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 31 - 35
[33] Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis
Li, Tao
Wang, Xinsheng
Xie, Qicong
Wang, Zhichao
Xie, Lei
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1448 - 1460
[34] End-to-end Indonesian Speech Synthesis Based On Transfer Learning And Alternate Training
Lu, Yu
Yang, Jian
Yang, Ruolin
2021 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2021, : 30 - 35
[35] LEARNING LATENT REPRESENTATIONS FOR STYLE CONTROL AND TRANSFER IN END-TO-END SPEECH SYNTHESIS
Zhang, Ya-Jie
Pan, Shifeng
He, Lei
Ling, Zhen-Hua
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6945 - 6949
[36] End-to-end Tibetan emotional speech synthesis based on Mandarin emotions transfer
Zhang, Weizhao
Zhang, Wenxuan
2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024, 2024, : 320 - 325
[37] Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
Wang, Yuxuan
Stanton, Daisy
Zhang, Yu
Skerry-Ryan, R. J.
Battenberg, Eric
Shor, Joel
Xiao, Ying
Ren, Fei
Jia, Ye
Saurous, Rif A.
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80, 2018, 80
[38] TOWARDS LANGUAGE-UNIVERSAL END-TO-END SPEECH RECOGNITION
Kim, Suyoun
Seltzer, Michael L.
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4914 - 4918
[39] Towards a Deep Understanding of Multilingual End-to-End Speech Translation
Sun, Haoran
Zhao, Xiaohu
Lei, Yikun
Zhu, Shaolin
Xiong, Deyi
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14332 - 14348
[40] Towards End-to-End Speech Recognition with Recurrent Neural Networks
Graves, Alex
Jaitly, Navdeep
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 2), 2014, 32 : 1764 - 1772

← 1 2 3 4 5 →