SPLGAN-TTS: Learning Semantic and Prosody to Enhance the Text-to-Speech Quality of Lightweight GAN Models

被引:0
|
作者
Chang, Ding-Chi [1 ]
Li, Shiou-Chi [2 ]
Huang, Jen-Wei [1 ]
机构
[1] Natl Cheng Kung Univ, Dept Elect Engn, Tainan, Taiwan
[2] Natl Cheng Kung Univ, Inst Comp & Commun Engn, Dept Elect Engn, Tainan, Taiwan
来源
关键词
speech synthesis; non-autoregressive; tree-based architecture; generative adversarial networks;
D O I
10.1007/978-981-96-2071-5_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Autoregressive-based models have proven effective in speech synthesis; however, numerous parameters and slow inference limit their applicabili ty. Though non-autoregressive models can resolve these issues, speech synthesis quality is unsatisfactory. This study employed a tree-based structure to enhance the learning of semantic and prosody information using a lightweight model. A Variational Encoder (VAE) is used for the generator architecture, and a novel normalizing-flow module is used to enhance the complexity of the VAE-generated distribution. We also developed a speech discriminator with a multi-length architecture to reduce computational overhead as well as multiple auxiliary losses to assist in model training. The proposed model is smaller than existing state-of-the-art models, and synthesis performance is faster, particularly when applied to longer texts. Despite the fact that the proposed model is roughly 30% smaller than FastSpeech2 [1], its mean opinion score surpasses FastSpeech2 as well as other models.
引用
收藏
页码:58 / 70
页数:13
相关论文
共 27 条
  • [21] Eden-TTS: A Simple and Efficient Parallel Text-to-speech Architecture with Collaborative Duration-alignment Learning
    Ma, Youneng
    He, Junyi
    Wu, Meimei
    Hu, Guangyue
    Fei, Haojun
    INTERSPEECH 2023, 2023, : 4449 - 4453
  • [22] Planning the development of text-to-speech synthesis models and datasets with dynamic deep learning
    Ahmad, Hawraz A.
    Rashid, Tarik A.
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2024, 36 (07)
  • [23] Effect of optical text recognition (OCR) and text-to-speech (TTS) smartphone app on vision-related quality of life in visually impaired people.
    Chatpaitoon, Benyapa
    Itthipanichpong, Rath
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2023, 64 (08)
  • [24] On-line experimental methods to evaluate text-to-speech (TTS) synthesis: effects of voice gender and signal quality on intelligibility, naturalness and preference
    Stevens, C
    Lees, N
    Vonwiller, J
    Burnhain, D
    COMPUTER SPEECH AND LANGUAGE, 2005, 19 (02): : 129 - 146
  • [25] Parameter Generation Methods With Rich Context Models for High-Quality and Flexible Text-To-Speech Synthesis
    Takamichi, Shinnosuke
    Toda, Tomoki
    Shiga, Yoshinori
    Sakti, Sakriani
    Neubig, Graham
    Nakamura, Satoshi
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2014, 8 (02) : 239 - 250
  • [26] VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design
    Kong, Jungil
    Park, Jihoon
    Kim, Beomjeong
    Kim, Jeongmin
    Kong, Dohee
    Kim, Sangjin
    INTERSPEECH 2023, 2023, : 4374 - 4378
  • [27] High quality text-to-speech synthesis system with efficient duration models developed using coding schemes based on vowel production characteristics
    Reddy, V. Ramu
    Rao, K. Sreenivasa
    2013 13TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS (ISDA), 2013, : 7 - 12