PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH

被引:10
|
作者
Karlapati, Sri [1 ]
Abbas, Ammar [1 ]
Hodari, Zack [2 ]
Moinet, Alexis [1 ]
Joly, Arnaud [1 ]
Karanasou, Penny [1 ]
Drugman, Thomas [1 ]
机构
[1] Amazon Res, Cambridge, England
[2] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
关键词
TTS; prosody modelling; contextual prosody;
D O I
10.1109/ICASSP39728.2021.9413696
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of 13.2% in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.
引用
收藏
页码:6573 / 6577
页数:5
相关论文
共 50 条
  • [31] An RNN-based prosodic information synthesizer for Mandarin text-to-speech
    Chen, SH
    Hwang, SH
    Wang, YR
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (03): : 226 - 239
  • [32] Neural networks for text-to-speech phoneme recognition
    Embrechts, MJ
    Arciniegas, F
    SMC 2000 CONFERENCE PROCEEDINGS: 2000 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN & CYBERNETICS, VOL 1-5, 2000, : 3582 - 3587
  • [33] CATOTRON - A Neural Text-to-Speech System in Catalan
    Kulebi, Baybars
    Oktem, Alp
    Peiro-Lilja, Alex
    Pascual, Santiago
    Farrus, Mireia
    INTERSPEECH 2020, 2020, : 490 - 491
  • [34] A tree-based model of prosodic phrasing for Chinese text-to-speech systems
    Chen, WJ
    Lin, FZ
    Li, JM
    Zhang, B
    ADVANCES IN MUTLIMEDIA INFORMATION PROCESSING - PCM 2001, PROCEEDINGS, 2001, 2195 : 1054 - 1059
  • [35] FedSpeech: Federated Text-to-Speech with Continual Learning
    Jiang, Ziyue
    Ren, Yi
    Lei, Ming
    Zhao, Zhou
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 3829 - 3835
  • [36] Learning Speaker Embedding from Text-to-Speech
    Cho, Jaejin
    Zelasko, Piotr
    Villalba, Jesus
    Watanabe, Shinji
    Dehak, Najim
    INTERSPEECH 2020, 2020, : 3256 - 3260
  • [37] New Method for Delexicalization and its Application to Prosodic Tagging for Text-to-Speech Synthesis
    Vainio, Martti
    Suni, Antti
    Raitio, Tuomo
    Nurminen, Jani
    Jarvikivi, Juhani
    Alku, Paavo
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1671 - 1674
  • [38] Decoding Knowledge Transfer for Neural Text-to-Speech Training
    Liu, Rui
    Sisman, Berrak
    Gao, Guanglai
    Li, Haizhou
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1789 - 1802
  • [39] Neural networks in text-to-speech systems for the Greek language
    Falas, T
    Stafylopatis, AG
    MELECON 2000: INFORMATION TECHNOLOGY AND ELECTROTECHNOLOGY FOR THE MEDITERRANEAN COUNTRIES, VOLS 1-3, PROCEEDINGS, 2000, : 574 - 577
  • [40] Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation
    Tu, Tao
    Chen, Yuan-Jui
    Liu, Alexander H.
    Lee, Hung-yi
    INTERSPEECH 2020, 2020, : 3191 - 3195