NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH

被引：7

作者：

Zhao, Botao ^{[1
,2
]}

Zhang, Xulong ^{[1
]}

Wang, Jianzong ^{[1
]}

Cheng, Ning ^{[1
]}

Xiao, Jing ^{[1
]}

机构：

[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Guangdong, Peoples R China

[2] Fudan Univ, Inst Sci & Technol Brain Inspired Intelligence, Shanghai, Peoples R China

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

zero-shot; multi-speaker text-to-speech; conditional variational autoencoder;

D O I：

10.1109/ICASSP43922.2022.9746875

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Multi-speaker text-to-speech (TTS) using a few adaption data is a challenge in practical applications. To address that, we propose a zero-shot multi-speaker TTS, named nnSpeech, that could synthesis a new speaker voice without fine-tuning and using only one adaption utterance. Compared with using a speaker representation module to extract the characteristics of new speakers, our method bases on a speaker-guided conditional variational autoencoder and can generate a variable Z, which contains both speaker characteristics and content information. The latent variable Z distribution is approximated by another variable conditioned on reference mel-spectrogram and phoneme. Experiments on the English corpus, Mandarin corpus, and cross-dataset proves that our model could generate natural and similar speech with only one adaption speech.

引用

页码：4293 / 4297

页数：5

共 50 条

[41] Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation
Tu, Tao
Chen, Yuan-Jui
Liu, Alexander H.
Lee, Hung-yi
INTERSPEECH 2020, 2020, : 3191 - 3195
[42] XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
Casanova, Edresson
Davis, Kelly
Goelge, Eren
Goekncar, Gorkem
Gulea, Iulian
Hart, Logan
Aljafari, Aya
Meyer, Joshua
Morais, Reuben
Olayemi, Samuel
Weber, Julian
INTERSPEECH 2024, 2024, : 4978 - 4982
[43] EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH
Lux, Florian
Koch, Julia
Vu, Ngoc Thang
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 962 - 969
[44] MnTTS2: An Open-Source Multi-speaker Mongolian Text-to-Speech Synthesis Dataset
Liang, Kailin
Liu, Bin
Hu, Yifan
Liu, Rui
Bao, Feilong
Gao, Guanglai
MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 318 - 329
[45] A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization
Cheon, Sung Jun
Choi, Byoung Jin
Kim, Minchan
Lee, Hyeonseung
Kim, Nam Soo
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 55 - 59
[46] Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations
Wang, Wenbin
Song, Yang
Jha, Sanjay
INTERSPEECH 2023, 2023, : 4454 - 4458
[47] Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Tang, Chuanxin
Luo, Chong
Zhao, Zhiyuan
Yin, Dacheng
Zhao, Yucheng
Zeng, Wenjun
INTERSPEECH 2021, 2021, : 3600 - 3604
[48] SCALING NVIDIA'S MULTI-SPEAKER MULTI-LINGUAL TTS SYSTEMS WITH ZERO-SHOT TTS TO INDIC LANGUAGES
Arora, Akshit
Badlani, Rohan
Kim, Sungwon
Valle, Rafael
Catanzaro, Bryan
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 115 - 116
[49] Speaker Specific Phrase Break Modeling with Conditional Random Fields for Text-to-Speech
Louw, Johannes A.
Moodley, Avashlin
2016 PATTERN RECOGNITION ASSOCIATION OF SOUTH AFRICA AND ROBOTICS AND MECHATRONICS INTERNATIONAL CONFERENCE (PRASA-ROBMECH), 2016,
[50] Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
Azizah, Kurniawati
IEEE ACCESS, 2024, 12 : 63528 - 63547

← 1 2 3 4 5 →