Speaker Adaptive Text-to-Speech With Timbre-Normalized Vector-Quantized Feature

被引：2

作者：

Du, Chenpeng ^{[1
]}

Guo, Yiwei ^{[1
]}

Chen, Xie ^{[1
]}

Yu, Kai ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, AI Inst, Dept Comp Sci & Engn, X LANCE Lab,MoE Key Lab of Artificial Intelligence, Shanghai 200240, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

关键词：

Speech synthesis; speaker adaptation; timbre normalization; vector quantization; PITCH;

D O I：

10.1109/TASLP.2023.3308374

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.

引用

页码：3446 / 3456

页数：11

共 50 条

[1] USAT: A Universal Speaker-Adaptive Text-to-Speech Approach
Wang, Wenbin
Song, Yang
Jha, Sanjay
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2590 - 2604
[2] PROSOSPEECH: ENHANCING PROSODY WITH QUANTIZED VECTOR PRE-TRAINING IN TEXT-TO-SPEECH
Ren, Yi
Lei, Ming
Huang, Zhiying
Zhang, Shiliang
Chen, Qian
Yan, Zhijie
Zhao, Zhou
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7577 - 7581
[3] Learning Speaker Embedding from Text-to-Speech
Cho, Jaejin
Zelasko, Piotr
Villalba, Jesus
Watanabe, Shinji
Dehak, Najim
INTERSPEECH 2020, 2020, : 3256 - 3260
[4] Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
Yamagishi, Junichi
Nose, Takashi
Zen, Heiga
Ling, Zhen-Hua
Toda, Tomoki
Tokuda, Keiichi
King, Simon
Renals, Steve
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (06): : 1208 - 1230
[5] A Speaker-Adaptive HMM-based Vietnamese Text-to-Speech System
Ninh, Duy Khanh
PROCEEDINGS OF 2019 11TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2019), 2019, : 342 - 346
[6] Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
Min, Dongchan
Lee, Dong Bok
Yang, Eunho
Hwang, Sung Ju
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[7] SPEAKER IDENTIFICATION BASED ON FREQUENCY DISTRIBUTION OF VECTOR-QUANTIZED SPECTRA.
Shirai, Katsuhiko
Mano, Kazunori
Ishige, Shunichi
Systems and Computers in Japan, 1988, 19 (06): : 63 - 72
[8] Quantized HMMs for Low Footprint Text-To-Speech Synthesis
Gutkin, Alexander
Gonzalvo, Xavi
Breuer, Stefan
Taylor, Paul
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 837 - 840
[9] SPEAKER INTONATION ADAPTATION FOR TRANSFORMING TEXT-TO-SPEECH SYNTHESIS SPEAKER IDENTITY
Langarani, Mahsa Sadat Elyasi
van Santen, Jan
2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 116 - 123
[10] Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data
Huang, Wen-Chin
Wu, Yi-Chiao
Toda, Tomoki
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2995 - 2999

← 1 2 3 4 5 →