Narrator or Character: Voice Modulation in an Expressive Multi-speaker TTS

被引:1
|
作者
Kalyan, T. Pavan [1 ]
Rao, Preeti [1 ]
Jyothi, Preethi [1 ]
Bhattacharyya, Pushpak [1 ]
机构
[1] Indian Inst Technol, Mumbai, Maharashtra, India
来源
关键词
Expressive TTS; speech synthesis; new TTS corpus; prosody modelling;
D O I
10.21437/Interspeech.2023-2469
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Current Text-to-Speech (TTS) systems are trained on audiobook data and perform well in synthesizing read-style speech. In this work, we are interested in synthesizing audio stories as narrated to children. The storytelling style is more expressive and requires perceptible changes of voice across the narrator and story characters. To address these challenges, we present a new TTS corpus of English audio stories for children with 32.7 hours of speech by a single female speaker with a UK accent. We provide evidence of the salient differences in the suprasegmentals of the narrator and character utterances in the dataset, motivating the use of a multi-speaker TTS for our application. We use a fine-tuned BERT model to label each sentence as being spoken by a narrator or character that is subsequently used to condition the TTS output. Experiments show our new TTS system is superior in expressiveness in both A-B preference and MOS testing compared to reading-style TTS and single-speaker TTS.
引用
收藏
页码:4808 / 4812
页数:5
相关论文
共 50 条
  • [41] JOINTLY RECOGNIZING MULTI-SPEAKER CONVERSATIONS
    Ji, Gang
    Bilmes, Jeff
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5110 - 5113
  • [42] Multi-Speaker Meeting Audio Segmentation
    Nwe, Tin Lay
    Dong, Minghui
    Khine, Swe Zin Kalayar
    Li, Haizhou
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2522 - 2525
  • [43] Speaker conditioned acoustic modeling for multi-speaker conversational ASR
    Chetupalli, Srikanth Raj
    Ganapathy, Sriram
    INTERSPEECH 2022, 2022, : 3834 - 3838
  • [44] ENERGY-BASED MULTI-SPEAKER VOICE ACTIVITY DETECTION WITH AN AD HOC MICROPHONE ARRAY
    Bertrand, Alexander
    Moonen, Marc
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 85 - 88
  • [45] Fast ICA for Multi-speaker Recognition System
    Zhou, Yan
    Zhao, Zhiqiang
    ADVANCED INTELLIGENT COMPUTING THEORIES AND APPLICATIONS, 2010, 93 : 507 - 513
  • [46] Multi-Speaker Dialogue for Vehicular Navigation and Assistance
    Hsien-Chang Wang
    Jhing-Fa Wang
    International Journal of Speech Technology, 2004, 7 (2-3) : 231 - 244
  • [47] AN INVESTIGATION OF MULTI-SPEAKER TRAINING FORWAVENET VOCODER
    Hayashi, Tomoki
    Tamamori, Akira
    Kobayashi, Kazuhiro
    Takeda, Kazuya
    Toda, Tomoki
    2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 712 - 718
  • [48] MultiSpeech: Multi-Speaker Text to Speech with Transformer
    Chen, Mingjian
    Tan, Xu
    Ren, Yi
    Xu, Jin
    Sun, Hao
    Zhao, Sheng
    Qin, Tao
    INTERSPEECH 2020, 2020, : 4024 - 4028
  • [49] Evolutive HMM for multi-speaker tracking system
    Meignier, S
    Bonastre, JF
    Fredouille, C
    Merlin, T
    2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, 2000, : 1201 - 1204
  • [50] Multi-speaker Recognition in Cocktail Party Problem
    Wang, Yiqian
    Sun, Wensheng
    COMMUNICATIONS, SIGNAL PROCESSING, AND SYSTEMS, 2019, 463 : 2116 - 2123