NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH

被引:7
|
作者
Zhao, Botao [1 ,2 ]
Zhang, Xulong [1 ]
Wang, Jianzong [1 ]
Cheng, Ning [1 ]
Xiao, Jing [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Guangdong, Peoples R China
[2] Fudan Univ, Inst Sci & Technol Brain Inspired Intelligence, Shanghai, Peoples R China
关键词
zero-shot; multi-speaker text-to-speech; conditional variational autoencoder;
D O I
10.1109/ICASSP43922.2022.9746875
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Multi-speaker text-to-speech (TTS) using a few adaption data is a challenge in practical applications. To address that, we propose a zero-shot multi-speaker TTS, named nnSpeech, that could synthesis a new speaker voice without fine-tuning and using only one adaption utterance. Compared with using a speaker representation module to extract the characteristics of new speakers, our method bases on a speaker-guided conditional variational autoencoder and can generate a variable Z, which contains both speaker characteristics and content information. The latent variable Z distribution is approximated by another variable conditioned on reference mel-spectrogram and phoneme. Experiments on the English corpus, Mandarin corpus, and cross-dataset proves that our model could generate natural and similar speech with only one adaption speech.
引用
收藏
页码:4293 / 4297
页数:5
相关论文
共 50 条
  • [31] Autoregressive multi-speaker model in Chinese speech synthesis based on variational autoencoder
    Hao, Xiaoyang
    Zhang, Pengyuan
    Shengxue Xuebao/Acta Acustica, 2022, 47 (03): : 405 - 416
  • [32] LIGHT-TTS: LIGHTWEIGHT MULTI-SPEAKER MULTI-LINGUAL TEXT-TO-SPEECH
    Li, Song
    Ouyang, Beibei
    Li, Lin
    Hong, Qingyang
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8383 - 8387
  • [33] Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech
    Nakai, Yusuke
    Saito, Yuki
    Udagawa, Kenta
    Saruwatari, Hiroshi
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 743 - 748
  • [34] MultiSpeech: Multi-Speaker Text to Speech with Transformer
    Chen, Mingjian
    Tan, Xu
    Ren, Yi
    Xu, Jin
    Sun, Hao
    Zhao, Sheng
    Qin, Tao
    INTERSPEECH 2020, 2020, : 4024 - 4028
  • [35] Multi-Scale Speaker Vectors for Zero-Shot Speech Synthesis
    Cory, Tristin
    Iqbal, Razib
    2022 IEEE 46TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2022), 2022, : 496 - 501
  • [36] An emotional speech synthesis markup language processor for multi-speaker and emotional text-to-speech applications
    Ryu, Se-Hui
    Cho, Hee
    Lee, Ju-Hyun
    Hong, Ki-Hyung
    JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2021, 40 (05): : 523 - 529
  • [37] Adapter-Based Extension of Multi-Speaker Text-To-Speech Model for New Speakers
    Hsieh, Cheng-Ping
    Ghosh, Subhankar
    Ginsburg, Boris
    INTERSPEECH 2023, 2023, : 3028 - 3032
  • [38] Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data
    Zhang, Xulong
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    2022 18TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING, MSN, 2022, : 456 - 460
  • [39] VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in theWild
    Peng, Puyuan
    Huang, Po-Yao
    Le, Shang-Wen
    Mohamed, Abdelrahman
    Harwath, David
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12442 - 12462
  • [40] Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios
    Xie, Qicong
    Li, Tao
    Wang, Xinsheng
    Wang, Zhichao
    Xie, Lei
    Yu, Guoqiao
    Wan, Guanglu
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 66 - 70