Variable-Length Speaker Conditioning in Flow-Based Text-to-Speech

被引:0
|
作者
Choi, Byoung Jin [1 ]
Jeong, Myeonghun [1 ]
Kim, Minchan [1 ]
Kim, Nam Soo [1 ]
机构
[1] Seoul Natl Univ, Inst New Media & Commun, Dept Elect & Comp Engn, Seoul 08826, South Korea
关键词
Speech synthesis; variable-length reference embedding sequence; zero-shot multi-speaker text-to-speech;
D O I
10.1109/LSP.2024.3377588
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this letter, we propose a novel speaker conditioning technique that leverages a variable-length reference embedding sequence for flow-based text-to-speech (TTS) architecture in the context of zero-shot multi-speaker text-to-speech (ZSM-TTS). Unlike conventional ZSM-TTS methods, which usually rely on a single fixed-dimensional vector to represent the entire reference speech, our approach aims to extract variable-length embedding sequence for a more flexible and efficient conditioning. We enhance the current affine coupling function in flow-based TTS architecture by introducing an attentive speaker conditioning. This allows a local variation of the speaker conditioning. Our experiments demonstrate the effectiveness of the proposed method, highlighting improvements in terms of speaker similarity, speech naturalness, and speech intelligibility compared to the baseline methods.
引用
收藏
页码:899 / 903
页数:5
相关论文
共 50 条
  • [1] Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding
    Choi, Seungwoo
    Han, Seungju
    Kim, Dongyoung
    Ha, Sungjoo
    INTERSPEECH 2020, 2020, : 2007 - 2011
  • [2] SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech
    Choi, Byoung Jin
    Jeong, Myeonghun
    Lee, Joun Yeop
    Kim, Nam Soo
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2502 - 2506
  • [3] Learning Speaker Embedding from Text-to-Speech
    Cho, Jaejin
    Zelasko, Piotr
    Villalba, Jesus
    Watanabe, Shinji
    Dehak, Najim
    INTERSPEECH 2020, 2020, : 3256 - 3260
  • [4] SYNTHE-SEES: FACE BASED TEXT-TO-SPEECH FOR VIRTUAL SPEAKER
    Park, Jae Hyun
    Maeng, Joon-Gyu
    Bak, TaeJun
    Jo, Young-Sun
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, : 10321 - 10325
  • [5] SPEAKER INTONATION ADAPTATION FOR TRANSFORMING TEXT-TO-SPEECH SYNTHESIS SPEAKER IDENTITY
    Langarani, Mahsa Sadat Elyasi
    van Santen, Jan
    2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 116 - 123
  • [6] Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data
    Huang, Wen-Chin
    Wu, Yi-Chiao
    Toda, Tomoki
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2995 - 2999
  • [7] Frequency Warping for Speaker Adaption of Text-to-speech Synthesis
    Gao, Weixun
    Cao, Qiying
    ICWMMN 2010, PROCEEDINGS, 2010, : 307 - +
  • [8] Multi-speaker Emotional Text-to-speech Synthesizer
    Cho, Sungjae
    Lee, Soo-Young
    INTERSPEECH 2021, 2021, : 2337 - 2338
  • [9] Towards pooled-speaker concatenative text-to-speech
    Eide, Ellen M.
    Picheny, Michael A.
    2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13, 2006, : 73 - 76
  • [10] A Text-to-Speech Platform for Variable Length Optimal Unit Searching Using Perception Based Cost Functions
    Minkyu Lee
    Daniel P. Lopresti
    Joseph P. Olive
    International Journal of Speech Technology, 2003, 6 (4) : 347 - 356