PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH

被引:10
|
作者
Karlapati, Sri [1 ]
Abbas, Ammar [1 ]
Hodari, Zack [2 ]
Moinet, Alexis [1 ]
Joly, Arnaud [1 ]
Karanasou, Penny [1 ]
Drugman, Thomas [1 ]
机构
[1] Amazon Res, Cambridge, England
[2] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
关键词
TTS; prosody modelling; contextual prosody;
D O I
10.1109/ICASSP39728.2021.9413696
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of 13.2% in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.
引用
收藏
页码:6573 / 6577
页数:5
相关论文
共 50 条
  • [1] A prosodic Turkish text-to-speech synthesizer
    Vural, E
    Oflazer, K
    PROCEEDINGS OF THE IEEE 12TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, 2004, : 458 - 460
  • [2] Controllable neural text-to-speech synthesis using intuitive prosodic features
    Raitio, Tuomo
    Rasipuram, Ramya
    Castellani, Dan
    INTERSPEECH 2020, 2020, : 4432 - 4436
  • [3] A prosodic model for text-to-speech synthesis in French
    Di Cristo, A
    Di Cristo, P
    Campione, E
    Véronis, J
    INTONATION: ANALYSIS, MODELLING AND TECHNOLOGY, 2000, 15 : 321 - 355
  • [4] A Prosodic Text-to-Speech System for Yoruba Language
    Akinwonmi, Akintoba Emmanuel
    Alese, Boniface Kayode
    2013 8TH INTERNATIONAL CONFERENCE FOR INTERNET TECHNOLOGY AND SECURED TRANSACTIONS (ICITST), 2013, : 630 - 635
  • [5] Prosodic annotation in a Thai Text-to-speech system
    Department of Electrical and Computer Engineering, Citadel, Military College of South Carolina, 171 Moultrie Street, Charleston, SC 29409, United States
    PACLIC - Pacific Asia Conf. Lang., Inf. Comput., Proc., 2007, (405-414):
  • [6] ON GRANULARITY OF PROSODIC REPRESENTATIONS IN EXPRESSIVE TEXT-TO-SPEECH
    Babianski, Mikolaj
    Pokora, Kamil
    Shah, Raahil
    Sienkiewicz, Rafal
    Korzekwa, Daniel
    Klimkov, Viacheslav
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 892 - 899
  • [7] Prosodic Annotation in a Thai Text-to-speech System
    Potisuk, Siripong
    PACLIC 21: THE 21ST PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION, PROCEEDINGS, 2007, : 405 - 414
  • [8] Increasing Prosodic Variability of Text-To-Speech Synthesizers
    Nemeth, Geza
    Fek, Mark
    Csapo, Tamas Gabor
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1981 - 1984
  • [9] Speech synthesis for text-to-speech alignment and prosodic feature extraction
    Malfrere, F
    Dutoit, T
    ISCAS '97 - PROCEEDINGS OF 1997 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS I - IV: CIRCUITS AND SYSTEMS IN THE INFORMATION AGE, 1997, : 2637 - 2640
  • [10] Prosodic boundary prediction model for Vietnamese text-to-speech
    Trang, Nguyen Thi Thu
    Ky, Nguyen Hoang
    Rilliard, Albert
    D'Alessandro, Christophe
    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 5 : 3366 - 3370