Narrator or Character: Voice Modulation in an Expressive Multi-speaker TTS

被引:1
|
作者
Kalyan, T. Pavan [1 ]
Rao, Preeti [1 ]
Jyothi, Preethi [1 ]
Bhattacharyya, Pushpak [1 ]
机构
[1] Indian Inst Technol, Mumbai, Maharashtra, India
来源
关键词
Expressive TTS; speech synthesis; new TTS corpus; prosody modelling;
D O I
10.21437/Interspeech.2023-2469
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Current Text-to-Speech (TTS) systems are trained on audiobook data and perform well in synthesizing read-style speech. In this work, we are interested in synthesizing audio stories as narrated to children. The storytelling style is more expressive and requires perceptible changes of voice across the narrator and story characters. To address these challenges, we present a new TTS corpus of English audio stories for children with 32.7 hours of speech by a single female speaker with a UK accent. We provide evidence of the salient differences in the suprasegmentals of the narrator and character utterances in the dataset, motivating the use of a multi-speaker TTS for our application. We use a fine-tuned BERT model to label each sentence as being spoken by a narrator or character that is subsequently used to condition the TTS output. Experiments show our new TTS system is superior in expressiveness in both A-B preference and MOS testing compared to reading-style TTS and single-speaker TTS.
引用
收藏
页码:4808 / 4812
页数:5
相关论文
共 50 条
  • [1] LIMMITS'24: Multi-Speaker, Multi-Lingual INDIC TTS With Voice Cloning
    Udupa, Sathvik
    Bandekar, Jesuraja
    Singh, Abhayjeet
    Deekshitha, G.
    Kumar, Saurabh
    Badiger, Sandhya
    Nagireddi, Amala
    Roopa, R.
    Ghosh, Prasanta Kumar
    Murthy, Hema A.
    Kumar, Pranaw
    Tokuda, Keiichi
    Hasegawa-Johnson, Mark
    Olbrich, Philipp
    IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2025, 6 : 293 - 302
  • [2] CAN WE USE COMMON VOICE TO TRAIN A MULTI-SPEAKER TTS SYSTEM?
    Ogun, Sewade
    Colotte, Vincent
    Vincent, Emmanuel
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 900 - 905
  • [3] LIMMITS'24: MULTI-SPEAKER, MULTI-LINGUAL INDIC TTS WITH VOICE CLONING<bold> </bold>
    Singh, Abhayjeet
    Nagireddi, Amala
    Deekshitha, G.
    Bandekar, Jesuraja
    Roopa, R.
    Badiger, Sandhya
    Udupa, Sathvik
    Ghosh, Prasanta Kumar
    Murthy, Hema A.
    Kumar, Pranaw
    Tokuda, Keiichi
    Hasegawa-Johnson, Mark
    Olbrich, Philipp
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 61 - 62
  • [4] Unsupervised Speaker and Expression Factorization for Multi-Speaker Expressive Synthesis of Ebooks
    Chen, Langzhou
    Braunschweiler, Norbert
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1041 - 1045
  • [5] MULTI-SPEAKER MODELING AND SPEAKER ADAPTATION FOR DNN-BASED TTS SYNTHESIS
    Fan, Yuchen
    Qian, Yao
    Soong, Frank K.
    He, Lei
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4475 - 4479
  • [6] Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?
    Cooper, Erica
    Lai, Cheng-, I
    Yasuda, Yusuke
    Yamagishi, Junichi
    INTERSPEECH 2020, 2020, : 3979 - 3983
  • [7] Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations
    Jeon, Yejin
    Kim, Yunsu
    Lee, Gary Geunbae
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18336 - 18344
  • [8] Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus
    Liang, Kailin
    Liu, Bin
    Hu, Yifan
    Liu, Rui
    Bao, Feilong
    Gao, Guanglai
    APPLIED SCIENCES-BASEL, 2023, 13 (07):
  • [9] Multi-speaker voice cryptographic key generation
    Paola Garcia-Perera, L.
    Carlos Mex-Perera, J.
    Nolazco-Flores, Juan A.
    3RD ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, 2005, 2005,
  • [10] Multi-speaker Beamforming for Voice Activity Classification
    Tran, Thuy N.
    Cowley, William
    Pollok, Andre
    2013 AUSTRALIAN COMMUNICATIONS THEORY WORKSHOP (AUSCTW), 2013, : 116 - 121