Variable-Length Speaker Conditioning in Flow-Based Text-to-Speech

被引：0

作者：

Choi, Byoung Jin ^{[1
]}

Jeong, Myeonghun ^{[1
]}

Kim, Minchan ^{[1
]}

Kim, Nam Soo ^{[1
]}

机构：

[1] Seoul Natl Univ, Inst New Media & Commun, Dept Elect & Comp Engn, Seoul 08826, South Korea

来源：

IEEE SIGNAL PROCESSING LETTERS | 2024年 / 31卷

关键词：

Speech synthesis; variable-length reference embedding sequence; zero-shot multi-speaker text-to-speech;

D O I：

10.1109/LSP.2024.3377588

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In this letter, we propose a novel speaker conditioning technique that leverages a variable-length reference embedding sequence for flow-based text-to-speech (TTS) architecture in the context of zero-shot multi-speaker text-to-speech (ZSM-TTS). Unlike conventional ZSM-TTS methods, which usually rely on a single fixed-dimensional vector to represent the entire reference speech, our approach aims to extract variable-length embedding sequence for a more flexible and efficient conditioning. We enhance the current affine coupling function in flow-based TTS architecture by introducing an attentive speaker conditioning. This allows a local variation of the speaker conditioning. Our experiments demonstrate the effectiveness of the proposed method, highlighting improvements in terms of speaker similarity, speech naturalness, and speech intelligibility compared to the baseline methods.

引用

页码：899 / 903

页数：5

共 50 条

[21] Automatic prosodic modeling for speaker and task adaptation in text-to-speech
LopezGonzalo, E
RodriguezGarcia, JM
HernandezGomez, L
Villar, JM
1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 927 - 930
[22] A Comparison of Speaker-based and Utterance-based Data Selection for Text-to-Speech Synthesis
Lee, Kai-Zhan
Cooper, Erica
Hirschberg, Julia
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2873 - 2877
[23] Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification
Wu, Yanfeng
Zhao, Junan
Guo, Chenkai
Xu, Jing
INTERSPEECH 2021, 2021, : 81 - 85
[24] An on-line variable-length binary encoding of text
Acharya, T
Jaja, JF
INFORMATION SCIENCES, 1996, 94 (1-4) : 1 - 22
[25] Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora
Luong, Hieu-Thi
Wang, Xin
Yamagishi, Junichi
Nishizawa, Nobuyuki
INTERSPEECH 2019, 2019, : 1303 - 1307
[26] Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Arik, Sercan O.
Diamos, Gregory
Gibiansky, Andrew
Miller, John
Peng, Kainan
Ping, Wei
Raiman, Jonathan
Zhou, Yanqi
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[27] TEXT COMPRESSION USING VARIABLE-LENGTH TO FIXED-LENGTH ENCODINGS
COOPER, D
LYNCH, MF
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1982, 33 (01): : 18 - 31
[28] Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
Jia, Ye
Zhang, Yu
Weiss, Ron J.
Wang, Quan
Shen, Jonathan
Ren, Fei
Chen, Zhifeng
Nguyen, Patrick
Pang, Ruoming
Moreno, Ignacio Lopez
Wu, Yonghui
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[29] HEARING FACES: TARGET SPEAKER TEXT-TO-SPEECH SYNTHESIS FROM A FACE
Pluester, Bjoern
Weber, Cornelius
Qu, Leyuan
Wermter, Stefan
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 757 - 764
[30] Speech activated telephony e-mail reader (SATER) based on speaker verification and text-to-speech conversion
Wu, CH
Chen, JH
INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS, 1997 DIGEST OF TECHNICAL PAPERS, 1997, : 338 - 339

← 1 2 3 4 5 →