Variable-Length Speaker Conditioning in Flow-Based Text-to-Speech

被引:0
|
作者
Choi, Byoung Jin [1 ]
Jeong, Myeonghun [1 ]
Kim, Minchan [1 ]
Kim, Nam Soo [1 ]
机构
[1] Seoul Natl Univ, Inst New Media & Commun, Dept Elect & Comp Engn, Seoul 08826, South Korea
关键词
Speech synthesis; variable-length reference embedding sequence; zero-shot multi-speaker text-to-speech;
D O I
10.1109/LSP.2024.3377588
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this letter, we propose a novel speaker conditioning technique that leverages a variable-length reference embedding sequence for flow-based text-to-speech (TTS) architecture in the context of zero-shot multi-speaker text-to-speech (ZSM-TTS). Unlike conventional ZSM-TTS methods, which usually rely on a single fixed-dimensional vector to represent the entire reference speech, our approach aims to extract variable-length embedding sequence for a more flexible and efficient conditioning. We enhance the current affine coupling function in flow-based TTS architecture by introducing an attentive speaker conditioning. This allows a local variation of the speaker conditioning. Our experiments demonstrate the effectiveness of the proposed method, highlighting improvements in terms of speaker similarity, speech naturalness, and speech intelligibility compared to the baseline methods.
引用
收藏
页码:899 / 903
页数:5
相关论文
共 50 条
  • [21] Automatic prosodic modeling for speaker and task adaptation in text-to-speech
    LopezGonzalo, E
    RodriguezGarcia, JM
    HernandezGomez, L
    Villar, JM
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 927 - 930
  • [22] A Comparison of Speaker-based and Utterance-based Data Selection for Text-to-Speech Synthesis
    Lee, Kai-Zhan
    Cooper, Erica
    Hirschberg, Julia
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2873 - 2877
  • [23] Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification
    Wu, Yanfeng
    Zhao, Junan
    Guo, Chenkai
    Xu, Jing
    INTERSPEECH 2021, 2021, : 81 - 85
  • [24] An on-line variable-length binary encoding of text
    Acharya, T
    Jaja, JF
    INFORMATION SCIENCES, 1996, 94 (1-4) : 1 - 22
  • [25] Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora
    Luong, Hieu-Thi
    Wang, Xin
    Yamagishi, Junichi
    Nishizawa, Nobuyuki
    INTERSPEECH 2019, 2019, : 1303 - 1307
  • [26] Deep Voice 2: Multi-Speaker Neural Text-to-Speech
    Arik, Sercan O.
    Diamos, Gregory
    Gibiansky, Andrew
    Miller, John
    Peng, Kainan
    Ping, Wei
    Raiman, Jonathan
    Zhou, Yanqi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [27] TEXT COMPRESSION USING VARIABLE-LENGTH TO FIXED-LENGTH ENCODINGS
    COOPER, D
    LYNCH, MF
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1982, 33 (01): : 18 - 31
  • [28] Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
    Jia, Ye
    Zhang, Yu
    Weiss, Ron J.
    Wang, Quan
    Shen, Jonathan
    Ren, Fei
    Chen, Zhifeng
    Nguyen, Patrick
    Pang, Ruoming
    Moreno, Ignacio Lopez
    Wu, Yonghui
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [29] HEARING FACES: TARGET SPEAKER TEXT-TO-SPEECH SYNTHESIS FROM A FACE
    Pluester, Bjoern
    Weber, Cornelius
    Qu, Leyuan
    Wermter, Stefan
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 757 - 764
  • [30] Speech activated telephony e-mail reader (SATER) based on speaker verification and text-to-speech conversion
    Wu, CH
    Chen, JH
    INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS, 1997 DIGEST OF TECHNICAL PAPERS, 1997, : 338 - 339