Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

被引:0
|
作者
Saeki, Takaaki [1 ]
Maiti, Soumi [2 ]
Li, Xinjian [2 ]
Watanabe, Shinji [2 ]
Takamichi, Shinnosuke [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Tokyo, Japan
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
引用
收藏
页码:5179 / 5187
页数:9
相关论文
共 50 条
  • [21] Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech
    Choi, Byoung Jin
    Jeong, Myeonghun
    Kim, Minchan
    Mun, Sung Hwan
    Kim, Nam Soo
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1708 - 1712
  • [22] ILATalk: a new multilingual text-to-speech synthesizer with machine learning
    Abu-Soud, Saleh M.
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2016, 19 (01) : 55 - 64
  • [23] Zero-Shot Turkish Text Classification
    Birim, Ahmet
    Erden, Mustafa
    Arslan, Levent M.
    29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,
  • [24] Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages
    Azizah, Kurniawati
    Jatmiko, Wisnu
    IEEE ACCESS, 2022, 10 : 5895 - 5911
  • [25] LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval
    Xu, Canwen
    Guo, Daya
    Duan, Nan
    McAuley, Julian
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 3557 - 3569
  • [26] Towards a multilingual prosody model for text-to-speech
    Jokisch, O
    Ding, HW
    Kruschke, H
    2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 421 - 424
  • [28] Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech
    Jeong, Myeonghun
    Kim, Minchan
    Choi, Byoung Jin
    Yoon, Jaesam
    Jang, Won
    Kim, Nam Soo
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1519 - 1530
  • [29] Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition
    Ni, Junrui
    Wang, Liming
    Gao, Heting
    Qian, Kaizhi
    Zhang, Yang
    Chang, Shiyu
    Hasegawa-Johnson, Mark
    INTERSPEECH 2022, 2022, : 461 - 465
  • [30] Compositional Zero-Shot Domain Transfer with Text-to-Text Models
    Liu, Fangyu
    Liu, Qianchu
    Bannur, Shruthi
    Perez-Garcia, Fernando
    Usuyama, Naoto
    Zhang, Sheng
    Naumann, Tristan
    Nori, Aditya
    Poon, Hoifung
    Alvarez-Valle, Javier
    Oktay, Ozan
    Hyland, Stephanie L.
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 1097 - 1113