Variable-Length Speaker Conditioning in Flow-Based Text-to-Speech

被引：0

作者：

Choi, Byoung Jin ^{[1
]}

Jeong, Myeonghun ^{[1
]}

Kim, Minchan ^{[1
]}

Kim, Nam Soo ^{[1
]}

机构：

[1] Seoul Natl Univ, Inst New Media & Commun, Dept Elect & Comp Engn, Seoul 08826, South Korea

来源：

IEEE SIGNAL PROCESSING LETTERS | 2024年 / 31卷

关键词：

Speech synthesis; variable-length reference embedding sequence; zero-shot multi-speaker text-to-speech;

D O I：

10.1109/LSP.2024.3377588

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In this letter, we propose a novel speaker conditioning technique that leverages a variable-length reference embedding sequence for flow-based text-to-speech (TTS) architecture in the context of zero-shot multi-speaker text-to-speech (ZSM-TTS). Unlike conventional ZSM-TTS methods, which usually rely on a single fixed-dimensional vector to represent the entire reference speech, our approach aims to extract variable-length embedding sequence for a more flexible and efficient conditioning. We enhance the current affine coupling function in flow-based TTS architecture by introducing an attentive speaker conditioning. This allows a local variation of the speaker conditioning. Our experiments demonstrate the effectiveness of the proposed method, highlighting improvements in terms of speaker similarity, speech naturalness, and speech intelligibility compared to the baseline methods.

引用

页码：899 / 903

页数：5

共 50 条

[1] Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding
Choi, Seungwoo
Han, Seungju
Kim, Dongyoung
Ha, Sungjoo
INTERSPEECH 2020, 2020, : 2007 - 2011
[2] SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech
Choi, Byoung Jin
Jeong, Myeonghun
Lee, Joun Yeop
Kim, Nam Soo
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2502 - 2506
[3] Learning Speaker Embedding from Text-to-Speech
Cho, Jaejin
Zelasko, Piotr
Villalba, Jesus
Watanabe, Shinji
Dehak, Najim
INTERSPEECH 2020, 2020, : 3256 - 3260
[4] SYNTHE-SEES: FACE BASED TEXT-TO-SPEECH FOR VIRTUAL SPEAKER
Park, Jae Hyun
Maeng, Joon-Gyu
Bak, TaeJun
Jo, Young-Sun
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, : 10321 - 10325
[5] SPEAKER INTONATION ADAPTATION FOR TRANSFORMING TEXT-TO-SPEECH SYNTHESIS SPEAKER IDENTITY
Langarani, Mahsa Sadat Elyasi
van Santen, Jan
2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 116 - 123
[6] Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data
Huang, Wen-Chin
Wu, Yi-Chiao
Toda, Tomoki
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2995 - 2999
[7] Frequency Warping for Speaker Adaption of Text-to-speech Synthesis
Gao, Weixun
Cao, Qiying
ICWMMN 2010, PROCEEDINGS, 2010, : 307 - +
[8] Multi-speaker Emotional Text-to-speech Synthesizer
Cho, Sungjae
Lee, Soo-Young
INTERSPEECH 2021, 2021, : 2337 - 2338
[9] Towards pooled-speaker concatenative text-to-speech
Eide, Ellen M.
Picheny, Michael A.
2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13, 2006, : 73 - 76
[10] A Text-to-Speech Platform for Variable Length Optimal Unit Searching Using Perception Based Cost Functions
Minkyu Lee
Daniel P. Lopresti
Joseph P. Olive
International Journal of Speech Technology, 2003, 6 (4) : 347 - 356

← 1 2 3 4 5 →