A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization

被引：1

作者：

Cheon, Sung Jun ^{[1
,2
]}

Choi, Byoung Jin ^{[1
,2
]}

Kim, Minchan ^{[1
,2
]}

Lee, Hyeonseung ^{[1
,2
]}

Kim, Nam Soo ^{[1
,2
]}

机构：

[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul 08826, South Korea

[2] Seoul Natl Univ, Inst New Media & Commun, Seoul 08826, South Korea

来源：

IEEE SIGNAL PROCESSING LETTERS | 2022年 / 29卷

关键词：

Training; Upper bound; Speech synthesis; Correlation; Mutual information; Synthesizers; Estimation; Disentanglement; mutual information; speech synthesis; style modeling; total correlation;

D O I：

10.1109/LSP.2021.3125259

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In this letter, we propose a multivariate information minimization method that disentangles three or more latent representations. We show that control factors can be disentangled by minimizing interactive dependency, which can be expressed as a sum of mutual information upper bound terms. Since the upper bound estimate converges from the early training stage, there is little performance degradation due to auxiliary loss. The proposed technique is applied to train a text-to-speech synthesizer with multi-lingual, multi-speaker, and multi-style corpora. Subjective listening tests validate that the proposed method can improve the synthesizer in terms of quality as well as controllability.

引用

页码：55 / 59

页数：5

共 50 条

[1] Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech
Singh, Abhayjeet
Nagireddi, Amala
Jayakumar, Anjali
Deekshitha, G.
Bandekar, Jesuraja
Roopa, R.
Badiger, Sandhya
Udupa, Sathvik
Kumar, Saurabh
Ghosh, Prasanta Kumar
Murthy, Hema A.
Zen, Heiga
Kumar, Pranaw
Kant, Kamal
Bole, Amol
Singh, Bira Chandra
Tokuda, Keiichi
Hasegawa-Johnson, Mark
Olbrich, Philipp
IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 790 - 798
[2] Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
Liu, Zhaoyu
Mak, Brian
INTERSPEECH 2020, 2020, : 2932 - 2936
[3] LIGHT-TTS: LIGHTWEIGHT MULTI-SPEAKER MULTI-LINGUAL TEXT-TO-SPEECH
Li, Song
Ouyang, Beibei
Li, Lin
Hong, Qingyang
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8383 - 8387
[4] INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH
Chien, Chung-Ming
Lin, Jheng-Hao
Huang, Chien-yu
Hsu, Po-chun
Lee, Hung-yi
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8588 - 8592
[5] Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios
Xie, Qicong
Li, Tao
Wang, Xinsheng
Wang, Zhichao
Xie, Lei
Yu, Guoqiao
Wan, Guanglu
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 66 - 70
[6] A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning based on Renyi Divergence Minimization
Paul, Dipjyoti
Mukherjee, Sankar
Pantazis, Yannis
Stylianou, Yannis
INTERSPEECH 2021, 2021, : 3625 - 3629
[7] Multi-speaker Multi-style Speech Synthesis with Timbre and Style Disentanglement
Song, Wei
Yue, Yanghao
Zhang, Ya-jie
Zhang, Zhengchen
Wu, Youzheng
He, Xiaodong
MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 132 - 140
[8] Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech
Jeong, Myeonghun
Kim, Minchan
Choi, Byoung Jin
Yoon, Jaesam
Jang, Won
Kim, Nam Soo
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1519 - 1530
[9] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
Chen, Mengnan
Chen, Minchuan
Liang, Shuang
Ma, Jun
Chen, Lei
Wang, Shaojun
Xiao, Jing
INTERSPEECH 2019, 2019, : 2105 - 2109
[10] Multi-speaker Emotional Text-to-speech Synthesizer
Cho, Sungjae
Lee, Soo-Young
INTERSPEECH 2021, 2021, : 2337 - 2338

← 1 2 3 4 5 →