Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

被引:0
|
作者
Saeki, Takaaki [1 ]
Maiti, Soumi [2 ]
Li, Xinjian [2 ]
Watanabe, Shinji [2 ]
Takamichi, Shinnosuke [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Tokyo, Japan
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
引用
收藏
页码:5179 / 5187
页数:9
相关论文
共 50 条
  • [1] XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
    Casanova, Edresson
    Davis, Kelly
    Goelge, Eren
    Goekncar, Gorkem
    Gulea, Iulian
    Hart, Logan
    Aljafari, Aya
    Meyer, Joshua
    Morais, Reuben
    Olayemi, Samuel
    Weber, Julian
    INTERSPEECH 2024, 2024, : 4978 - 4982
  • [2] Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
    Tang, Chuanxin
    Luo, Chong
    Zhao, Zhiyuan
    Yin, Dacheng
    Zhao, Yucheng
    Zeng, Wenjun
    INTERSPEECH 2021, 2021, : 3600 - 3604
  • [3] VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in theWild
    Peng, Puyuan
    Huang, Po-Yao
    Le, Shang-Wen
    Mohamed, Abdelrahman
    Harwath, David
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12442 - 12462
  • [4] EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH
    Lux, Florian
    Koch, Julia
    Vu, Ngoc Thang
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 962 - 969
  • [5] Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
    Azizah, Kurniawati
    IEEE ACCESS, 2024, 12 : 63528 - 63547
  • [6] Multilingual text analysis for text-to-speech synthesis
    Bell Lab, Murray Hill, United States
    International Conference on Spoken Language Processing, ICSLP, Proceedings, 1996, 3 : 1365 - 1368
  • [7] Multilingual text analysis for text-to-speech synthesis
    Sproat, R
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1365 - 1368
  • [8] Multilingual text-to-speech synthesis
    Black, AW
    Lenzo, KA
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL III, PROCEEDINGS: IMAGE AND MULTIDIMENSIONAL SIGNAL PROCESSING SPECIAL SESSIONS, 2004, : 761 - 764
  • [9] Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis
    Chen, Zhiyong
    Ai, Zhiqi
    Ma, Youxuan
    Li, Xinnuo
    Xu, Shugong
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2024, 2024 (01):
  • [10] UNSUPERVISED POLYGLOT TEXT-TO-SPEECH
    Nachmani, Eliya
    Wolf, Lior
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7055 - 7059