Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

被引：0

作者：

Saeki, Takaaki ^{[1
]}

Maiti, Soumi ^{[2
]}

Li, Xinjian ^{[2
]}

Watanabe, Shinji ^{[2
]}

Takamichi, Shinnosuke ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Tokyo, Japan

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.

引用

页码：5179 / 5187

页数：9

共 50 条

[1] XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
Casanova, Edresson
Davis, Kelly
Goelge, Eren
Goekncar, Gorkem
Gulea, Iulian
Hart, Logan
Aljafari, Aya
Meyer, Joshua
Morais, Reuben
Olayemi, Samuel
Weber, Julian
INTERSPEECH 2024, 2024, : 4978 - 4982
[2] Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Tang, Chuanxin
Luo, Chong
Zhao, Zhiyuan
Yin, Dacheng
Zhao, Yucheng
Zeng, Wenjun
INTERSPEECH 2021, 2021, : 3600 - 3604
[3] VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in theWild
Peng, Puyuan
Huang, Po-Yao
Le, Shang-Wen
Mohamed, Abdelrahman
Harwath, David
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12442 - 12462
[4] EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH
Lux, Florian
Koch, Julia
Vu, Ngoc Thang
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 962 - 969
[5] Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
Azizah, Kurniawati
IEEE ACCESS, 2024, 12 : 63528 - 63547
[6] Multilingual text analysis for text-to-speech synthesis
Bell Lab, Murray Hill, United States
International Conference on Spoken Language Processing, ICSLP, Proceedings, 1996, 3 : 1365 - 1368
[7] Multilingual text analysis for text-to-speech synthesis
Sproat, R
ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1365 - 1368
[8] Multilingual text-to-speech synthesis
Black, AW
Lenzo, KA
2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL III, PROCEEDINGS: IMAGE AND MULTIDIMENSIONAL SIGNAL PROCESSING SPECIAL SESSIONS, 2004, : 761 - 764
[9] Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis
Chen, Zhiyong
Ai, Zhiqi
Ma, Youxuan
Li, Xinnuo
Xu, Shugong
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2024, 2024 (01):
[10] UNSUPERVISED POLYGLOT TEXT-TO-SPEECH
Nachmani, Eliya
Wolf, Lior
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7055 - 7059

← 1 2 3 4 5 →