Narrator or Character: Voice Modulation in an Expressive Multi-speaker TTS

被引:1
|
作者
Kalyan, T. Pavan [1 ]
Rao, Preeti [1 ]
Jyothi, Preethi [1 ]
Bhattacharyya, Pushpak [1 ]
机构
[1] Indian Inst Technol, Mumbai, Maharashtra, India
来源
关键词
Expressive TTS; speech synthesis; new TTS corpus; prosody modelling;
D O I
10.21437/Interspeech.2023-2469
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Current Text-to-Speech (TTS) systems are trained on audiobook data and perform well in synthesizing read-style speech. In this work, we are interested in synthesizing audio stories as narrated to children. The storytelling style is more expressive and requires perceptible changes of voice across the narrator and story characters. To address these challenges, we present a new TTS corpus of English audio stories for children with 32.7 hours of speech by a single female speaker with a UK accent. We provide evidence of the salient differences in the suprasegmentals of the narrator and character utterances in the dataset, motivating the use of a multi-speaker TTS for our application. We use a fine-tuned BERT model to label each sentence as being spoken by a narrator or character that is subsequently used to condition the TTS output. Experiments show our new TTS system is superior in expressiveness in both A-B preference and MOS testing compared to reading-style TTS and single-speaker TTS.
引用
收藏
页码:4808 / 4812
页数:5
相关论文
共 50 条
  • [31] A hybrid approach to speaker recognition in multi-speaker environment
    Trivedi, J
    Maitra, A
    Mitra, SK
    PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2005, 3776 : 272 - 275
  • [32] Multi-array multi-speaker tracking
    Potamitis, I
    Tremoulis, G
    Fakotakis, N
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2003, 2807 : 206 - 213
  • [33] UNSUPERVISED CLUSTERING OF EMOTION AND VOICE STYLES FOR EXPRESSIVE TTS
    Eyben, Florian
    Buchholz, Sabine
    Braunschweiler, Norbert
    Latorre, Javier
    Wan, Vincent
    Gales, Mark J. F.
    Knill, Kate
    2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4009 - 4012
  • [34] Automatic speaker clustering from multi-speaker utterances
    MIT Lincoln Lab, Lexington, United States
    ICASSP IEEE Int Conf Acoust Speech Signal Process Proc, (817-820):
  • [35] Automatic speaker clustering from multi-speaker utterances
    McLaughlin, J
    Reynolds, D
    Singer, E
    O'Leary, GC
    ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 817 - 820
  • [36] Multi-Speaker Voice Activity Detection Using a Camera-assisted Microphone Array
    Bergh, Trond E.
    Hafizovicz, Ines
    Holm, Sverre
    PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON SYSTEMS, SIGNALS AND IMAGE PROCESSING, (IWSSIP 2016), 2016, : 327 - 330
  • [37] MULTI-SPEAKER AND MULTI-DOMAIN EMOTIONAL VOICE CONVERSION USING FACTORIZED HIERARCHICAL VARIATIONAL AUTOENCODER
    Elgaar, Mohamed
    Park, Jungbae
    Lee, Sang Wan
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7769 - 7773
  • [38] Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
    Medennikov, Ivan
    Korenevsky, Maxim
    Prisyach, Tatiana
    Khokhlov, Yuri
    Korenevskaya, Mariya
    Sorokin, Ivan
    Timofeeva, Tatiana
    Mitrofanov, Anton
    Andrusenko, Andrei
    Podluzhny, Ivan
    Laptev, Aleksandr
    Romanenko, Aleksei
    INTERSPEECH 2020, 2020, : 274 - 278
  • [39] Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS
    Ko, Myeongjin
    Kim, Euiyeon
    Choi, Yong-Hoon
    IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 577 - 587
  • [40] Speaker Clustering with Penalty Distance for Speaker Verification with Multi-Speaker Speech
    Das, Rohan Kumar
    Yang, Jichen
    Li, Haizhou
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1630 - 1635