SPLGAN-TTS: Learning Semantic and Prosody to Enhance the Text-to-Speech Quality of Lightweight GAN Models

被引:0
|
作者
Chang, Ding-Chi [1 ]
Li, Shiou-Chi [2 ]
Huang, Jen-Wei [1 ]
机构
[1] Natl Cheng Kung Univ, Dept Elect Engn, Tainan, Taiwan
[2] Natl Cheng Kung Univ, Inst Comp & Commun Engn, Dept Elect Engn, Tainan, Taiwan
来源
关键词
speech synthesis; non-autoregressive; tree-based architecture; generative adversarial networks;
D O I
10.1007/978-981-96-2071-5_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Autoregressive-based models have proven effective in speech synthesis; however, numerous parameters and slow inference limit their applicabili ty. Though non-autoregressive models can resolve these issues, speech synthesis quality is unsatisfactory. This study employed a tree-based structure to enhance the learning of semantic and prosody information using a lightweight model. A Variational Encoder (VAE) is used for the generator architecture, and a novel normalizing-flow module is used to enhance the complexity of the VAE-generated distribution. We also developed a speech discriminator with a multi-length architecture to reduce computational overhead as well as multiple auxiliary losses to assist in model training. The proposed model is smaller than existing state-of-the-art models, and synthesis performance is faster, particularly when applied to longer texts. Despite the fact that the proposed model is roughly 30% smaller than FastSpeech2 [1], its mean opinion score surpasses FastSpeech2 as well as other models.
引用
收藏
页码:58 / 70
页数:13
相关论文
共 27 条
  • [1] Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech
    Huang, Rongjie
    Zhang, Chunlei
    Ren, Yi
    Zhao, Zhou
    Yu, Dong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 8018 - 8034
  • [2] EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models
    Lam, Perry
    Zhang, Huayun
    Chen, Nancy F.
    Sisman, Berrak
    INTERSPEECH 2022, 2022, : 823 - 827
  • [3] A Framework to Analyze Quality of Service (QoS) for Text-To-Speech (TTS) Services
    Fudzee, Mohd Farhan Md
    Hassan, Mohamud
    Mahdin, Hairulnizam
    Kasim, Shahreen
    Abawajy, Jemal
    RECENT ADVANCES ON SOFT COMPUTING AND DATA MINING, 2017, 549 : 589 - 597
  • [4] High-quality prosody generation in Mandarin text-to-speech system
    Guo, Qing
    Zhang, Jie
    Katae, Nobuyuki
    Yu, Hao
    Fujitsu Scientific and Technical Journal, 2010, 46 (01): : 40 - 46
  • [5] High-Quality Prosody Generation in Mandarin Text-to-Speech System
    Guo, Qing
    Zhang, Jie
    Katae, Nobuyuki
    Yu, Hao
    FUJITSU SCIENTIFIC & TECHNICAL JOURNAL, 2010, 46 (01): : 40 - 46
  • [6] Text processing techniques for text-to-speech conversion systems to enhance the quality of synthesized speech
    ATR Interpreting Telecommunications, Research Lab
    NTT R&D, 10 (1011-1018):
  • [7] EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech
    Liang, Ziqi
    Shi, Haoxiang
    Wang, Jiawei
    Lu, Keda
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 1134 - 1139
  • [8] Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech
    Jiang, Ziyue
    Su, Zhe
    Zhao, Zhou
    Yang, Qian
    Ren, Yi
    Liu, Jinglin
    Ye, Zhenhui
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [9] LIGHT-TTS: LIGHTWEIGHT MULTI-SPEAKER MULTI-LINGUAL TEXT-TO-SPEECH
    Li, Song
    Ouyang, Beibei
    Li, Lin
    Hong, Qingyang
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8383 - 8387
  • [10] Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech
    Zhang, Guangyan
    Merritt, Thomas
    Ribeiro, Manuel Sam
    Tura-Vecino, Biel
    Yanagisawa, Kayoko
    Pokora, Kamil
    Ezzerg, Abdelhamid
    Cygert, Sebastian
    Abbas, Ammar
    Bilinski, Piotr
    Barra-Chicote, Roberto
    Korzekwa, Daniel
    Lorenzo-Trueba, Jaime
    INTERSPEECH 2023, 2023, : 27 - 31