A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization

被引：1

作者：

Cheon, Sung Jun ^{[1
,2
]}

Choi, Byoung Jin ^{[1
,2
]}

Kim, Minchan ^{[1
,2
]}

Lee, Hyeonseung ^{[1
,2
]}

Kim, Nam Soo ^{[1
,2
]}

机构：

[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul 08826, South Korea

[2] Seoul Natl Univ, Inst New Media & Commun, Seoul 08826, South Korea

来源：

IEEE SIGNAL PROCESSING LETTERS | 2022年 / 29卷

关键词：

Training; Upper bound; Speech synthesis; Correlation; Mutual information; Synthesizers; Estimation; Disentanglement; mutual information; speech synthesis; style modeling; total correlation;

D O I：

10.1109/LSP.2021.3125259

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In this letter, we propose a multivariate information minimization method that disentangles three or more latent representations. We show that control factors can be disentangled by minimizing interactive dependency, which can be expressed as a sum of mutual information upper bound terms. Since the upper bound estimate converges from the early training stage, there is little performance degradation due to auxiliary loss. The proposed technique is applied to train a text-to-speech synthesizer with multi-lingual, multi-speaker, and multi-style corpora. Subjective listening tests validate that the proposed method can improve the synthesizer in terms of quality as well as controllability.

引用

页码：55 / 59

页数：5

共 50 条

[41] SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
Casanova, Edresson
Shulby, Christopher
Golge, Eren
Muller, Nicolas Michael
de Oliveira, Frederico Santos
Candido Junior, Arnaldo
Soares, Anderson da Silva
Aluisio, Sandra Maria
Ponti, Moacir Antonelli
INTERSPEECH 2021, 2021, : 3645 - 3649
[42] Cross-lingual multi-speaker speech synthesis with limited bilingual training data
Cai, Zexin
Yang, Yaogen
Li, Ming
COMPUTER SPEECH AND LANGUAGE, 2023, 77
[43] NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH
Zhao, Botao
Zhang, Xulong
Wang, Jianzong
Cheng, Ning
Xiao, Jing
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4293 - 4297
[44] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
Kumar, Neeraj
Narang, Ankur
Lall, Brejesh
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
[45] Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech
Choi, Byoung Jin
Jeong, Myeonghun
Kim, Minchan
Mun, Sung Hwan
Kim, Nam Soo
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1708 - 1712
[46] Incorporating Cross-speaker Style Transfer for Multi-language Text-to-Speech
Shang, Zengqiang
Huang, Zhihua
Zhang, Haozhe
Zhang, Pengyuan
Yan, Yonghong
INTERSPEECH 2021, 2021, : 1619 - 1623
[47] SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
Yoon, Hyungchan
Kim, Changhwan
Um, Seyun
Yoon, Hyun-Wook
Kang, Hong-Goo
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 593 - 597
[48] Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image
Goto, Shunsuke
Onishi, Kotaro
Saito, Yuki
Tachibana, Kentaro
Mori, Koichiro
INTERSPEECH 2020, 2020, : 1321 - 1325
[49] Gated Recurrent Attention for Multi-Style Speech Synthesis
Cheon, Sung Jun
Lee, Joun Yeop
Choi, Byoung Jin
Lee, Hyeonseung
Kim, Nam Soo
APPLIED SCIENCES-BASEL, 2020, 10 (15):
[50] Multi-lingual interoperability in speech technology
Steeneken, HJM
SPEECH COMMUNICATION, 2001, 35 (1-2) : 1 - 3

← 1 2 3 4 5 →