A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization

被引:1
|
作者
Cheon, Sung Jun [1 ,2 ]
Choi, Byoung Jin [1 ,2 ]
Kim, Minchan [1 ,2 ]
Lee, Hyeonseung [1 ,2 ]
Kim, Nam Soo [1 ,2 ]
机构
[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul 08826, South Korea
[2] Seoul Natl Univ, Inst New Media & Commun, Seoul 08826, South Korea
关键词
Training; Upper bound; Speech synthesis; Correlation; Mutual information; Synthesizers; Estimation; Disentanglement; mutual information; speech synthesis; style modeling; total correlation;
D O I
10.1109/LSP.2021.3125259
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this letter, we propose a multivariate information minimization method that disentangles three or more latent representations. We show that control factors can be disentangled by minimizing interactive dependency, which can be expressed as a sum of mutual information upper bound terms. Since the upper bound estimate converges from the early training stage, there is little performance degradation due to auxiliary loss. The proposed technique is applied to train a text-to-speech synthesizer with multi-lingual, multi-speaker, and multi-style corpora. Subjective listening tests validate that the proposed method can improve the synthesizer in terms of quality as well as controllability.
引用
收藏
页码:55 / 59
页数:5
相关论文
共 50 条
  • [1] Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech
    Singh, Abhayjeet
    Nagireddi, Amala
    Jayakumar, Anjali
    Deekshitha, G.
    Bandekar, Jesuraja
    Roopa, R.
    Badiger, Sandhya
    Udupa, Sathvik
    Kumar, Saurabh
    Ghosh, Prasanta Kumar
    Murthy, Hema A.
    Zen, Heiga
    Kumar, Pranaw
    Kant, Kamal
    Bole, Amol
    Singh, Bira Chandra
    Tokuda, Keiichi
    Hasegawa-Johnson, Mark
    Olbrich, Philipp
    IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 790 - 798
  • [2] Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
    Liu, Zhaoyu
    Mak, Brian
    INTERSPEECH 2020, 2020, : 2932 - 2936
  • [3] LIGHT-TTS: LIGHTWEIGHT MULTI-SPEAKER MULTI-LINGUAL TEXT-TO-SPEECH
    Li, Song
    Ouyang, Beibei
    Li, Lin
    Hong, Qingyang
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8383 - 8387
  • [4] INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH
    Chien, Chung-Ming
    Lin, Jheng-Hao
    Huang, Chien-yu
    Hsu, Po-chun
    Lee, Hung-yi
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8588 - 8592
  • [5] Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios
    Xie, Qicong
    Li, Tao
    Wang, Xinsheng
    Wang, Zhichao
    Xie, Lei
    Yu, Guoqiao
    Wan, Guanglu
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 66 - 70
  • [6] A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning based on Renyi Divergence Minimization
    Paul, Dipjyoti
    Mukherjee, Sankar
    Pantazis, Yannis
    Stylianou, Yannis
    INTERSPEECH 2021, 2021, : 3625 - 3629
  • [7] Multi-speaker Multi-style Speech Synthesis with Timbre and Style Disentanglement
    Song, Wei
    Yue, Yanghao
    Zhang, Ya-jie
    Zhang, Zhengchen
    Wu, Youzheng
    He, Xiaodong
    MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 132 - 140
  • [8] Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech
    Jeong, Myeonghun
    Kim, Minchan
    Choi, Byoung Jin
    Yoon, Jaesam
    Jang, Won
    Kim, Nam Soo
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1519 - 1530
  • [9] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
    Chen, Mengnan
    Chen, Minchuan
    Liang, Shuang
    Ma, Jun
    Chen, Lei
    Wang, Shaojun
    Xiao, Jing
    INTERSPEECH 2019, 2019, : 2105 - 2109
  • [10] Multi-speaker Emotional Text-to-speech Synthesizer
    Cho, Sungjae
    Lee, Soo-Young
    INTERSPEECH 2021, 2021, : 2337 - 2338