Speaker Adaptive Text-to-Speech With Timbre-Normalized Vector-Quantized Feature

被引:2
|
作者
Du, Chenpeng [1 ]
Guo, Yiwei [1 ]
Chen, Xie [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, AI Inst, Dept Comp Sci & Engn, X LANCE Lab,MoE Key Lab of Artificial Intelligence, Shanghai 200240, Peoples R China
关键词
Speech synthesis; speaker adaptation; timbre normalization; vector quantization; PITCH;
D O I
10.1109/TASLP.2023.3308374
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.
引用
收藏
页码:3446 / 3456
页数:11
相关论文
共 50 条
  • [1] USAT: A Universal Speaker-Adaptive Text-to-Speech Approach
    Wang, Wenbin
    Song, Yang
    Jha, Sanjay
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2590 - 2604
  • [2] PROSOSPEECH: ENHANCING PROSODY WITH QUANTIZED VECTOR PRE-TRAINING IN TEXT-TO-SPEECH
    Ren, Yi
    Lei, Ming
    Huang, Zhiying
    Zhang, Shiliang
    Chen, Qian
    Yan, Zhijie
    Zhao, Zhou
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7577 - 7581
  • [3] Learning Speaker Embedding from Text-to-Speech
    Cho, Jaejin
    Zelasko, Piotr
    Villalba, Jesus
    Watanabe, Shinji
    Dehak, Najim
    INTERSPEECH 2020, 2020, : 3256 - 3260
  • [4] Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
    Yamagishi, Junichi
    Nose, Takashi
    Zen, Heiga
    Ling, Zhen-Hua
    Toda, Tomoki
    Tokuda, Keiichi
    King, Simon
    Renals, Steve
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (06): : 1208 - 1230
  • [5] A Speaker-Adaptive HMM-based Vietnamese Text-to-Speech System
    Ninh, Duy Khanh
    PROCEEDINGS OF 2019 11TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2019), 2019, : 342 - 346
  • [6] Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
    Min, Dongchan
    Lee, Dong Bok
    Yang, Eunho
    Hwang, Sung Ju
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [7] SPEAKER IDENTIFICATION BASED ON FREQUENCY DISTRIBUTION OF VECTOR-QUANTIZED SPECTRA.
    Shirai, Katsuhiko
    Mano, Kazunori
    Ishige, Shunichi
    Systems and Computers in Japan, 1988, 19 (06): : 63 - 72
  • [8] Quantized HMMs for Low Footprint Text-To-Speech Synthesis
    Gutkin, Alexander
    Gonzalvo, Xavi
    Breuer, Stefan
    Taylor, Paul
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 837 - 840
  • [9] SPEAKER INTONATION ADAPTATION FOR TRANSFORMING TEXT-TO-SPEECH SYNTHESIS SPEAKER IDENTITY
    Langarani, Mahsa Sadat Elyasi
    van Santen, Jan
    2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 116 - 123
  • [10] Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data
    Huang, Wen-Chin
    Wu, Yi-Chiao
    Toda, Tomoki
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2995 - 2999