Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis

被引:4
|
作者
Saeki, Takaaki [1 ]
Maiti, Soumi [2 ]
Li, Xinjian [2 ]
Watanabe, Shinji [2 ]
Takamichi, Shinnosuke [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Bunkyo Ku, Tokyo 1138656, Japan
[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
关键词
Multilingual text-to-speech; low-resource adaptation; adaptation of masked language model; graphone;
D O I
10.1109/TASLP.2024.3369537
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Neural text-to-speech (TTS) systems have made significant progress in generating natural synthetic speech. However, neural TTS requires large amounts of paired training data, which limits its applicability to a small number of resource-rich languages. Previous work on low-resource TTS has addressed the data hungriness based on transfer learning from a multilingual model to low-resource languages, but it still relies heavily on the availability of paired data for the target languages. In this paper, we propose a text-inductive language adaptation framework for low-resource TTS to address the cost of collecting the paired data for low-resource languages. To inject textual knowledge during transfer learning, our framework employs a two-stage adaptation scheme that utilizes both text-only and supervised data for the target language. In the text-based adaptation stage, we update the language-aware embedding layer with a masked language model objective using text-only data for the target language. In the supervised adaptation stage, the entire TTS model is updated using paired data for the target language. We also propose a graphone-based multilingual training method that jointly uses graphemes and International Phonetic Alphabet symbols (referred to as graphones) for resource-rich languages, while using only graphemes for low-resource languages. This approach facilitates the transfer of pronunciation knowledge from resource-rich to low-resource languages. Through extensive evaluations, we demonstrate that 1) our framework with text-based adaptation outperforms the previous supervised transfer learning approach and 2) the proposed graphone-based training method further improves the performance of both multilingual TTS and low-resource language adaptation. With only 5 minutes of paired data for fine-tuning, our method achieved highly intelligible synthetic speech with the character error rates of around 6 % for a target language.
引用
收藏
页码:1829 / 1844
页数:16
相关论文
共 50 条
  • [21] Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora
    Nespoli, Francesco
    Barreda, Daniel
    Naylor, Patrick A.
    FIFTY-SEVENTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, IEEECONF, 2023, : 1080 - 1084
  • [22] Cross-Lingual Language Modeling for Low-Resource Speech Recognition
    Xu, Ping
    Fung, Pascale
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (06): : 1134 - 1144
  • [23] LOW-RESOURCE LANGUAGE IDENTIFICATION FROM SPEECH USING TRANSFER LEARNING
    Feng, Kexin
    Chaspari, Theodora
    2019 IEEE 29TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2019,
  • [24] Language-universal phonetic encoder for low-resource speech recognition
    Feng, Siyuan
    Tu, Ming
    Xia, Rui
    Huang, Chuanzeng
    Wang, Yuxuan
    INTERSPEECH 2023, 2023, : 1429 - 1433
  • [25] A General Procedure for Improving Language Models in Low-Resource Speech Recognition
    Liu, Qian
    Zhang, Wei-Qiang
    Liu, Jia
    Liu, Yao
    PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 428 - 433
  • [26] Language-Adversarial Transfer Learning for Low-Resource Speech Recognition
    Yi, Jiangyan
    Tao, Jianhua
    Wen, Zhengqi
    Bai, Ye
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (03) : 621 - 630
  • [27] TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS
    Zhang, Xulong
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [28] Low-Resource Speech Synthesis with Speaker-Aware Embedding
    Yang, Li-Jen
    Yeh, I-Ping
    Chien, Jen-Tzung
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 235 - 239
  • [29] Efficient Adaptation: Enhancing Multilingual Models for Low-Resource Language Translation
    Sel, Ilhami
    Hanbay, Davut
    MATHEMATICS, 2024, 12 (19)
  • [30] LOW-RESOURCE SYSTEM FOR ALL-DIGITAL SPEECH SYNTHESIS
    HERMAN, G
    DUQUET, RT
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1977, 61 : S68 - S68