PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH

被引:10
|
作者
Karlapati, Sri [1 ]
Abbas, Ammar [1 ]
Hodari, Zack [2 ]
Moinet, Alexis [1 ]
Joly, Arnaud [1 ]
Karanasou, Penny [1 ]
Drugman, Thomas [1 ]
机构
[1] Amazon Res, Cambridge, England
[2] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
关键词
TTS; prosody modelling; contextual prosody;
D O I
10.1109/ICASSP39728.2021.9413696
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of 13.2% in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.
引用
收藏
页码:6573 / 6577
页数:5
相关论文
共 50 条
  • [41] Parameter selection for prosodic modelling in a restricted-domain Spanish text-to-speech system
    Montero, JM
    de Córdoba, R
    Macías-Guarasa, J
    San-Segundo, R
    Gutiérrez-Arriola, J
    Pardo, JM
    Image Processing, Biomedicine, Multimedia, Financial Engineering and Manufacturing, Vol 18, 2004, 18 : 93 - 98
  • [42] A two-stage prosodic structure generation strategy for Mandarin text-to-speech systems
    Dong Y.
    Zhou T.
    Dong C.-Y.
    Wang H.-L.
    Zidonghua Xuebao/Acta Automatica Sinica, 2010, 36 (11): : 1569 - 1574
  • [43] TEXT-TO-SPEECH SYNTHESIS
    SPROAT, RW
    OLIVE, JP
    AT&T TECHNICAL JOURNAL, 1995, 74 (02): : 35 - 44
  • [44] Automatic conversion from lexical words to prosodic words for mandarin text-to-speech system
    Shao, Yanqiu
    Han, Jiqing
    Liu, Ting
    Zhao, Yongzhen
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2007, 10 (01) : 45 - 55
  • [45] The Art of Text-to-Speech
    Lindquist, Benjamin
    CRITICAL INQUIRY, 2024, 50 (02) : 225 - 251
  • [46] Text-to-speech for customers
    不详
    EXPERT SYSTEMS, 1998, 15 (01) : 66 - 66
  • [47] Software text-to-speech
    Hallahan W.I.
    International Journal of Speech Technology, 1997, 1 (2) : 121 - 134
  • [48] A Taiwanese text-to-speech system with applications to language learning
    Liang, MS
    Yang, RC
    Chiang, YC
    Lyu, DC
    Lyu, RY
    IEEE INTERNATIONAL CONFERENCE ON ADVANCED LEARNING TECHNOLOGIES, PROCEEDINGS, 2004, : 91 - 95
  • [49] TOWARDS LIFELONG LEARNING OF MULTILINGUAL TEXT-TO-SPEECH SYNTHESIS
    Yang, Mu
    Ding, Shaojin
    Chen, Tianlong
    Wang, Tong
    Wang, Zhangyang
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8022 - 8026
  • [50] Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis
    Wang, Xin
    Takaki, Shinji
    Yamagishi, Junichi
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (10): : 2471 - 2480