Variable-Length Speaker Conditioning in Flow-Based Text-to-Speech

被引:0
|
作者
Choi, Byoung Jin [1 ]
Jeong, Myeonghun [1 ]
Kim, Minchan [1 ]
Kim, Nam Soo [1 ]
机构
[1] Seoul Natl Univ, Inst New Media & Commun, Dept Elect & Comp Engn, Seoul 08826, South Korea
关键词
Speech synthesis; variable-length reference embedding sequence; zero-shot multi-speaker text-to-speech;
D O I
10.1109/LSP.2024.3377588
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this letter, we propose a novel speaker conditioning technique that leverages a variable-length reference embedding sequence for flow-based text-to-speech (TTS) architecture in the context of zero-shot multi-speaker text-to-speech (ZSM-TTS). Unlike conventional ZSM-TTS methods, which usually rely on a single fixed-dimensional vector to represent the entire reference speech, our approach aims to extract variable-length embedding sequence for a more flexible and efficient conditioning. We enhance the current affine coupling function in flow-based TTS architecture by introducing an attentive speaker conditioning. This allows a local variation of the speaker conditioning. Our experiments demonstrate the effectiveness of the proposed method, highlighting improvements in terms of speaker similarity, speech naturalness, and speech intelligibility compared to the baseline methods.
引用
收藏
页码:899 / 903
页数:5
相关论文
共 50 条
  • [41] Spatial Speaker: 3D Java']Java Text-to-Speech Converter
    Sodnik, Jaka
    Tomazic, Saso
    WCECS 2009: WORLD CONGRESS ON ENGINEERING AND COMPUTER SCIENCE, VOLS I AND II, 2009, : 1306 - 1310
  • [42] LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus
    Koizumi, Yuma
    Zen, Heiga
    Karita, Shigeki
    Ding, Yifan
    Yatabe, Kohei
    Morioka, Nobuyuki
    Bacchiani, Michiel
    Zhang, Yu
    Han, Wei
    Bapna, Ankur
    INTERSPEECH 2023, 2023, : 5496 - 5500
  • [43] Inference of variable-length acoustic units for continuous speech recognition
    Deligne, S
    Bimbot, F
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 1731 - 1734
  • [44] Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
    Min, Dongchan
    Lee, Dong Bok
    Yang, Eunho
    Hwang, Sung Ju
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [45] Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech
    Singh, Abhayjeet
    Nagireddi, Amala
    Jayakumar, Anjali
    Deekshitha, G.
    Bandekar, Jesuraja
    Roopa, R.
    Badiger, Sandhya
    Udupa, Sathvik
    Kumar, Saurabh
    Ghosh, Prasanta Kumar
    Murthy, Hema A.
    Zen, Heiga
    Kumar, Pranaw
    Kant, Kamal
    Bole, Amol
    Singh, Bira Chandra
    Tokuda, Keiichi
    Hasegawa-Johnson, Mark
    Olbrich, Philipp
    IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 790 - 798
  • [46] VARIABLE-LENGTH PACKETIZATION OF mu -LAW PCM SPEECH.
    Steele, Raymond
    Benjamin, Frank
    1600, (64):
  • [47] VARIABLE-LENGTH PACKETIZATION OF MU-LAW PCM SPEECH
    STEELE, R
    BENJAMIN, F
    AT&T TECHNICAL JOURNAL, 1985, 64 (06): : 1271 - 1292
  • [48] Application of reversible variable-length codes in robust speech coding
    Wang, H
    Koh, SN
    Chang, WW
    IEE PROCEEDINGS-COMMUNICATIONS, 2005, 152 (03): : 272 - 276
  • [49] EXAMPLAR-BASED SPEECH WAVEFORM GENERATION FOR TEXT-TO-SPEECH
    Valentini-Botinhao, Cassia
    Watts, Oliver
    Espic, Felipe
    King, Simon
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 332 - 338
  • [50] Weighted Feature Fusion Based Emotional Recognition for Variable-length Speech using DNN
    Wu, Sifan
    Li, Fei
    Zhang, Pengyuan
    2019 15TH INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE (IWCMC), 2019, : 674 - 679