Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis

被引:4
|
作者
Saeki, Takaaki [1 ]
Maiti, Soumi [2 ]
Li, Xinjian [2 ]
Watanabe, Shinji [2 ]
Takamichi, Shinnosuke [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Bunkyo Ku, Tokyo 1138656, Japan
[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
关键词
Multilingual text-to-speech; low-resource adaptation; adaptation of masked language model; graphone;
D O I
10.1109/TASLP.2024.3369537
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Neural text-to-speech (TTS) systems have made significant progress in generating natural synthetic speech. However, neural TTS requires large amounts of paired training data, which limits its applicability to a small number of resource-rich languages. Previous work on low-resource TTS has addressed the data hungriness based on transfer learning from a multilingual model to low-resource languages, but it still relies heavily on the availability of paired data for the target languages. In this paper, we propose a text-inductive language adaptation framework for low-resource TTS to address the cost of collecting the paired data for low-resource languages. To inject textual knowledge during transfer learning, our framework employs a two-stage adaptation scheme that utilizes both text-only and supervised data for the target language. In the text-based adaptation stage, we update the language-aware embedding layer with a masked language model objective using text-only data for the target language. In the supervised adaptation stage, the entire TTS model is updated using paired data for the target language. We also propose a graphone-based multilingual training method that jointly uses graphemes and International Phonetic Alphabet symbols (referred to as graphones) for resource-rich languages, while using only graphemes for low-resource languages. This approach facilitates the transfer of pronunciation knowledge from resource-rich to low-resource languages. Through extensive evaluations, we demonstrate that 1) our framework with text-based adaptation outperforms the previous supervised transfer learning approach and 2) the proposed graphone-based training method further improves the performance of both multilingual TTS and low-resource language adaptation. With only 5 minutes of paired data for fine-tuning, our method achieved highly intelligible synthetic speech with the character error rates of around 6 % for a target language.
引用
收藏
页码:1829 / 1844
页数:16
相关论文
共 50 条
  • [31] ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios
    Wang, Yuyue
    Xiao, Huan
    Wu, Yihan
    Song, Ruihua
    INTERSPEECH 2023, 2023, : 4828 - 4832
  • [32] USING SPEECH ENHANCEMENT TO REALIZE SPEECH SYNTHESIS OF LOW-RESOURCE DUNGAN LANGUAGES
    Jiang, Rui
    Chen, Chengsi
    Shan, Xin
    Yang, Hongwu
    2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2021, : 193 - 198
  • [33] Review of Speech Synthesis Methods Under Low-Resource Condition
    Jialin, Zhang
    Wushouer, Mairidan
    Tuerhong, Gulanbaier
    Computer Engineering and Applications, 2023, 59 (15): : 1 - 16
  • [34] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
    Zolzaya Byambadorj
    Ryota Nishimura
    Altangerel Ayush
    Kengo Ohta
    Norihide Kitaoka
    EURASIP Journal on Audio, Speech, and Music Processing, 2021
  • [35] DOMAIN ADAPTATION OF END-TO-END SPEECH RECOGNITION IN LOW-RESOURCE SETTINGS
    Samarakoon, Lahiru
    Mak, Brian
    Lam, Albert Y. S.
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 382 - 388
  • [36] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
    Byambadorj, Zolzaya
    Nishimura, Ryota
    Ayush, Altangerel
    Ohta, Kengo
    Kitaoka, Norihide
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
  • [37] SPEECH SYNTHESIS USING HMM BASED DIPHONE INVENTORY ENCODING FOR LOW-RESOURCE DEVICES
    Strecha, Guntram
    Wolff, Matthias
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 5380 - 5383
  • [38] Pre-training on High-Resource Speech Recognition Improves Low-Resource Speech-to-Text Translation
    Bansal, Sameer
    Kamper, Herman
    Livescu, Karen
    Lopez, Adam
    Goldwater, Sharon
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 58 - 68
  • [39] Application of Quantum Recurrent Neural Network in Low-Resource Language Text Classification
    Yu, Wenbin
    Yin, Lei
    Zhang, Chengjun
    Chen, Yadang
    Liu, Alex X.
    IEEE TRANSACTIONS ON QUANTUM ENGINEERING, 2024, 5
  • [40] Prompt-based for Low-Resource Tibetan Text Classification
    An, Bo
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (08)