PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH

被引:10
|
作者
Karlapati, Sri [1 ]
Abbas, Ammar [1 ]
Hodari, Zack [2 ]
Moinet, Alexis [1 ]
Joly, Arnaud [1 ]
Karanasou, Penny [1 ]
Drugman, Thomas [1 ]
机构
[1] Amazon Res, Cambridge, England
[2] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
关键词
TTS; prosody modelling; contextual prosody;
D O I
10.1109/ICASSP39728.2021.9413696
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of 13.2% in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.
引用
收藏
页码:6573 / 6577
页数:5
相关论文
共 50 条
  • [21] Trainable prosodic model for standard Chinese Text-to-Speech system
    TAO Jianhua
    ChineseJournalofAcoustics, 2001, (03) : 257 - 265
  • [22] Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech
    Liu, Bin
    Liu, Rui
    Li, Haizhou
    MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2024, 2025, 2312 : 326 - 337
  • [23] TIME-DOMAIN PROSODIC MODIFICATIONS FOR TEXT-TO-SPEECH SYNTHESIZER
    Lopatka, Kuba
    Suchomski, Piotr
    Czyzewski, Andrzej
    SPA 2010: SIGNAL PROCESSING ALGORITHMS, ARCHITECTURES, ARRANGEMENTS, AND APPLICATIONS CONFERENCE PROCEEDINGS, 2010, : 73 - 77
  • [24] A prosodic phrasing model for a Korean text-to-speech synthesis system
    Yoon, K
    COMPUTER SPEECH AND LANGUAGE, 2006, 20 (01): : 69 - 79
  • [25] Derivation of prosody for text-to-speech from prosodic sentence structure
    Quene, Hugo
    Kager, Rene
    Computer Speech and Language, 1992, 6 (01): : 77 - 98
  • [26] A method for estimating prosodic symbol from text for Japanese text-to-speech synthesis
    Magata, K
    Hamagami, T
    Komura, M
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1373 - 1376
  • [27] SegINR: Segment-Wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech
    Kim, Minchan
    Jeong, Myeonghun
    Lee, Joun Yeop
    Kim, Nam Soo
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 646 - 650
  • [28] Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech
    Choi, Yeunju
    Jung, Youngmoon
    Suh, Youngjoo
    Kim, Hoirin
    IEEE ACCESS, 2022, 10 : 52621 - 52629
  • [29] Research on prosodic features and their prediction issues in Uyghur Text-to-Speech System
    Hamdulla, Askar
    Rozi, Askar
    Eli, Gulnar
    Tursun, Dilmurat
    PROCEEDINGS OF THE 2009 PACIFIC-ASIA CONFERENCE ON CIRCUITS, COMMUNICATIONS AND SYSTEM, 2009, : 257 - 260
  • [30] Prosodic rules for schwa-deletion in hindi text-to-speech synthesis
    Tyson, Na'im
    Nagar, Ila
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2009, 12 (01) : 15 - 25