NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH

被引:7
|
作者
Zhao, Botao [1 ,2 ]
Zhang, Xulong [1 ]
Wang, Jianzong [1 ]
Cheng, Ning [1 ]
Xiao, Jing [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Guangdong, Peoples R China
[2] Fudan Univ, Inst Sci & Technol Brain Inspired Intelligence, Shanghai, Peoples R China
关键词
zero-shot; multi-speaker text-to-speech; conditional variational autoencoder;
D O I
10.1109/ICASSP43922.2022.9746875
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Multi-speaker text-to-speech (TTS) using a few adaption data is a challenge in practical applications. To address that, we propose a zero-shot multi-speaker TTS, named nnSpeech, that could synthesis a new speaker voice without fine-tuning and using only one adaption utterance. Compared with using a speaker representation module to extract the characteristics of new speakers, our method bases on a speaker-guided conditional variational autoencoder and can generate a variable Z, which contains both speaker characteristics and content information. The latent variable Z distribution is approximated by another variable conditioned on reference mel-spectrogram and phoneme. Experiments on the English corpus, Mandarin corpus, and cross-dataset proves that our model could generate natural and similar speech with only one adaption speech.
引用
收藏
页码:4293 / 4297
页数:5
相关论文
共 50 条
  • [41] Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation
    Tu, Tao
    Chen, Yuan-Jui
    Liu, Alexander H.
    Lee, Hung-yi
    INTERSPEECH 2020, 2020, : 3191 - 3195
  • [42] XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
    Casanova, Edresson
    Davis, Kelly
    Goelge, Eren
    Goekncar, Gorkem
    Gulea, Iulian
    Hart, Logan
    Aljafari, Aya
    Meyer, Joshua
    Morais, Reuben
    Olayemi, Samuel
    Weber, Julian
    INTERSPEECH 2024, 2024, : 4978 - 4982
  • [43] EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH
    Lux, Florian
    Koch, Julia
    Vu, Ngoc Thang
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 962 - 969
  • [44] MnTTS2: An Open-Source Multi-speaker Mongolian Text-to-Speech Synthesis Dataset
    Liang, Kailin
    Liu, Bin
    Hu, Yifan
    Liu, Rui
    Bao, Feilong
    Gao, Guanglai
    MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 318 - 329
  • [45] A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization
    Cheon, Sung Jun
    Choi, Byoung Jin
    Kim, Minchan
    Lee, Hyeonseung
    Kim, Nam Soo
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 55 - 59
  • [46] Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations
    Wang, Wenbin
    Song, Yang
    Jha, Sanjay
    INTERSPEECH 2023, 2023, : 4454 - 4458
  • [47] Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
    Tang, Chuanxin
    Luo, Chong
    Zhao, Zhiyuan
    Yin, Dacheng
    Zhao, Yucheng
    Zeng, Wenjun
    INTERSPEECH 2021, 2021, : 3600 - 3604
  • [48] SCALING NVIDIA'S MULTI-SPEAKER MULTI-LINGUAL TTS SYSTEMS WITH ZERO-SHOT TTS TO INDIC LANGUAGES
    Arora, Akshit
    Badlani, Rohan
    Kim, Sungwon
    Valle, Rafael
    Catanzaro, Bryan
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 115 - 116
  • [49] Speaker Specific Phrase Break Modeling with Conditional Random Fields for Text-to-Speech
    Louw, Johannes A.
    Moodley, Avashlin
    2016 PATTERN RECOGNITION ASSOCIATION OF SOUTH AFRICA AND ROBOTICS AND MECHATRONICS INTERNATIONAL CONFERENCE (PRASA-ROBMECH), 2016,
  • [50] Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
    Azizah, Kurniawati
    IEEE ACCESS, 2024, 12 : 63528 - 63547