PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH

被引：10

作者：

Karlapati, Sri ^{[1
]}

Abbas, Ammar ^{[1
]}

Hodari, Zack ^{[2
]}

Moinet, Alexis ^{[1
]}

Joly, Arnaud ^{[1
]}

Karanasou, Penny ^{[1
]}

Drugman, Thomas ^{[1
]}

机构：

[1] Amazon Res, Cambridge, England

[2] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

TTS; prosody modelling; contextual prosody;

D O I：

10.1109/ICASSP39728.2021.9413696

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of 13.2% in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.

引用

页码：6573 / 6577

页数：5

共 50 条

[41] Parameter selection for prosodic modelling in a restricted-domain Spanish text-to-speech system
Montero, JM
de Córdoba, R
Macías-Guarasa, J
San-Segundo, R
Gutiérrez-Arriola, J
Pardo, JM
Image Processing, Biomedicine, Multimedia, Financial Engineering and Manufacturing, Vol 18, 2004, 18 : 93 - 98
[42] A two-stage prosodic structure generation strategy for Mandarin text-to-speech systems
Dong Y.
Zhou T.
Dong C.-Y.
Wang H.-L.
Zidonghua Xuebao/Acta Automatica Sinica, 2010, 36 (11): : 1569 - 1574
[43] TEXT-TO-SPEECH SYNTHESIS
SPROAT, RW
OLIVE, JP
AT&T TECHNICAL JOURNAL, 1995, 74 (02): : 35 - 44
[44] Automatic conversion from lexical words to prosodic words for mandarin text-to-speech system
Shao, Yanqiu
Han, Jiqing
Liu, Ting
Zhao, Yongzhen
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2007, 10 (01) : 45 - 55
[45] The Art of Text-to-Speech
Lindquist, Benjamin
CRITICAL INQUIRY, 2024, 50 (02) : 225 - 251
[46] Text-to-speech for customers
不详
EXPERT SYSTEMS, 1998, 15 (01) : 66 - 66
[47] Software text-to-speech
Hallahan W.I.
International Journal of Speech Technology, 1997, 1 (2) : 121 - 134
[48] A Taiwanese text-to-speech system with applications to language learning
Liang, MS
Yang, RC
Chiang, YC
Lyu, DC
Lyu, RY
IEEE INTERNATIONAL CONFERENCE ON ADVANCED LEARNING TECHNOLOGIES, PROCEEDINGS, 2004, : 91 - 95
[49] TOWARDS LIFELONG LEARNING OF MULTILINGUAL TEXT-TO-SPEECH SYNTHESIS
Yang, Mu
Ding, Shaojin
Chen, Tianlong
Wang, Tong
Wang, Zhangyang
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8022 - 8026
[50] Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis
Wang, Xin
Takaki, Shinji
Yamagishi, Junichi
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (10): : 2471 - 2480

← 1 2 3 4 5 →