PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH

被引：10

作者：

Karlapati, Sri ^{[1
]}

Abbas, Ammar ^{[1
]}

Hodari, Zack ^{[2
]}

Moinet, Alexis ^{[1
]}

Joly, Arnaud ^{[1
]}

Karanasou, Penny ^{[1
]}

Drugman, Thomas ^{[1
]}

机构：

[1] Amazon Res, Cambridge, England

[2] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

TTS; prosody modelling; contextual prosody;

D O I：

10.1109/ICASSP39728.2021.9413696

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of 13.2% in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.

引用

页码：6573 / 6577

页数：5

共 50 条

[1] A prosodic Turkish text-to-speech synthesizer
Vural, E
Oflazer, K
PROCEEDINGS OF THE IEEE 12TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, 2004, : 458 - 460
[2] Controllable neural text-to-speech synthesis using intuitive prosodic features
Raitio, Tuomo
Rasipuram, Ramya
Castellani, Dan
INTERSPEECH 2020, 2020, : 4432 - 4436
[3] A prosodic model for text-to-speech synthesis in French
Di Cristo, A
Di Cristo, P
Campione, E
Véronis, J
INTONATION: ANALYSIS, MODELLING AND TECHNOLOGY, 2000, 15 : 321 - 355
[4] A Prosodic Text-to-Speech System for Yoruba Language
Akinwonmi, Akintoba Emmanuel
Alese, Boniface Kayode
2013 8TH INTERNATIONAL CONFERENCE FOR INTERNET TECHNOLOGY AND SECURED TRANSACTIONS (ICITST), 2013, : 630 - 635
[5] Prosodic annotation in a Thai Text-to-speech system
Department of Electrical and Computer Engineering, Citadel, Military College of South Carolina, 171 Moultrie Street, Charleston, SC 29409, United States
PACLIC - Pacific Asia Conf. Lang., Inf. Comput., Proc., 2007, (405-414):
[6] ON GRANULARITY OF PROSODIC REPRESENTATIONS IN EXPRESSIVE TEXT-TO-SPEECH
Babianski, Mikolaj
Pokora, Kamil
Shah, Raahil
Sienkiewicz, Rafal
Korzekwa, Daniel
Klimkov, Viacheslav
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 892 - 899
[7] Prosodic Annotation in a Thai Text-to-speech System
Potisuk, Siripong
PACLIC 21: THE 21ST PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION, PROCEEDINGS, 2007, : 405 - 414
[8] Increasing Prosodic Variability of Text-To-Speech Synthesizers
Nemeth, Geza
Fek, Mark
Csapo, Tamas Gabor
INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1981 - 1984
[9] Speech synthesis for text-to-speech alignment and prosodic feature extraction
Malfrere, F
Dutoit, T
ISCAS '97 - PROCEEDINGS OF 1997 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS I - IV: CIRCUITS AND SYSTEMS IN THE INFORMATION AGE, 1997, : 2637 - 2640
[10] Prosodic boundary prediction model for Vietnamese text-to-speech
Trang, Nguyen Thi Thu
Ky, Nguyen Hoang
Rilliard, Albert
D'Alessandro, Christophe
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 5 : 3366 - 3370

← 1 2 3 4 5 →