Variable-Length Speaker Conditioning in Flow-Based Text-to-Speech

被引：0

作者：

Choi, Byoung Jin ^{[1
]}

Jeong, Myeonghun ^{[1
]}

Kim, Minchan ^{[1
]}

Kim, Nam Soo ^{[1
]}

机构：

[1] Seoul Natl Univ, Inst New Media & Commun, Dept Elect & Comp Engn, Seoul 08826, South Korea

来源：

IEEE SIGNAL PROCESSING LETTERS | 2024年 / 31卷

关键词：

Speech synthesis; variable-length reference embedding sequence; zero-shot multi-speaker text-to-speech;

D O I：

10.1109/LSP.2024.3377588

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In this letter, we propose a novel speaker conditioning technique that leverages a variable-length reference embedding sequence for flow-based text-to-speech (TTS) architecture in the context of zero-shot multi-speaker text-to-speech (ZSM-TTS). Unlike conventional ZSM-TTS methods, which usually rely on a single fixed-dimensional vector to represent the entire reference speech, our approach aims to extract variable-length embedding sequence for a more flexible and efficient conditioning. We enhance the current affine coupling function in flow-based TTS architecture by introducing an attentive speaker conditioning. This allows a local variation of the speaker conditioning. Our experiments demonstrate the effectiveness of the proposed method, highlighting improvements in terms of speaker similarity, speech naturalness, and speech intelligibility compared to the baseline methods.

引用

页码：899 / 903

页数：5

共 50 条

[41] Spatial Speaker: 3D Java']Java Text-to-Speech Converter
Sodnik, Jaka
Tomazic, Saso
WCECS 2009: WORLD CONGRESS ON ENGINEERING AND COMPUTER SCIENCE, VOLS I AND II, 2009, : 1306 - 1310
[42] LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus
Koizumi, Yuma
Zen, Heiga
Karita, Shigeki
Ding, Yifan
Yatabe, Kohei
Morioka, Nobuyuki
Bacchiani, Michiel
Zhang, Yu
Han, Wei
Bapna, Ankur
INTERSPEECH 2023, 2023, : 5496 - 5500
[43] Inference of variable-length acoustic units for continuous speech recognition
Deligne, S
Bimbot, F
1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 1731 - 1734
[44] Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
Min, Dongchan
Lee, Dong Bok
Yang, Eunho
Hwang, Sung Ju
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[45] Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech
Singh, Abhayjeet
Nagireddi, Amala
Jayakumar, Anjali
Deekshitha, G.
Bandekar, Jesuraja
Roopa, R.
Badiger, Sandhya
Udupa, Sathvik
Kumar, Saurabh
Ghosh, Prasanta Kumar
Murthy, Hema A.
Zen, Heiga
Kumar, Pranaw
Kant, Kamal
Bole, Amol
Singh, Bira Chandra
Tokuda, Keiichi
Hasegawa-Johnson, Mark
Olbrich, Philipp
IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 790 - 798
[46] VARIABLE-LENGTH PACKETIZATION OF mu -LAW PCM SPEECH.
Steele, Raymond
Benjamin, Frank
1600, (64):
[47] VARIABLE-LENGTH PACKETIZATION OF MU-LAW PCM SPEECH
STEELE, R
BENJAMIN, F
AT&T TECHNICAL JOURNAL, 1985, 64 (06): : 1271 - 1292
[48] Application of reversible variable-length codes in robust speech coding
Wang, H
Koh, SN
Chang, WW
IEE PROCEEDINGS-COMMUNICATIONS, 2005, 152 (03): : 272 - 276
[49] EXAMPLAR-BASED SPEECH WAVEFORM GENERATION FOR TEXT-TO-SPEECH
Valentini-Botinhao, Cassia
Watts, Oliver
Espic, Felipe
King, Simon
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 332 - 338
[50] Weighted Feature Fusion Based Emotional Recognition for Variable-length Speech using DNN
Wu, Sifan
Li, Fei
Zhang, Pengyuan
2019 15TH INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE (IWCMC), 2019, : 674 - 679

← 1 2 3 4 5 →