Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis

被引:4
|
作者
Saeki, Takaaki [1 ]
Maiti, Soumi [2 ]
Li, Xinjian [2 ]
Watanabe, Shinji [2 ]
Takamichi, Shinnosuke [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Bunkyo Ku, Tokyo 1138656, Japan
[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
关键词
Multilingual text-to-speech; low-resource adaptation; adaptation of masked language model; graphone;
D O I
10.1109/TASLP.2024.3369537
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Neural text-to-speech (TTS) systems have made significant progress in generating natural synthetic speech. However, neural TTS requires large amounts of paired training data, which limits its applicability to a small number of resource-rich languages. Previous work on low-resource TTS has addressed the data hungriness based on transfer learning from a multilingual model to low-resource languages, but it still relies heavily on the availability of paired data for the target languages. In this paper, we propose a text-inductive language adaptation framework for low-resource TTS to address the cost of collecting the paired data for low-resource languages. To inject textual knowledge during transfer learning, our framework employs a two-stage adaptation scheme that utilizes both text-only and supervised data for the target language. In the text-based adaptation stage, we update the language-aware embedding layer with a masked language model objective using text-only data for the target language. In the supervised adaptation stage, the entire TTS model is updated using paired data for the target language. We also propose a graphone-based multilingual training method that jointly uses graphemes and International Phonetic Alphabet symbols (referred to as graphones) for resource-rich languages, while using only graphemes for low-resource languages. This approach facilitates the transfer of pronunciation knowledge from resource-rich to low-resource languages. Through extensive evaluations, we demonstrate that 1) our framework with text-based adaptation outperforms the previous supervised transfer learning approach and 2) the proposed graphone-based training method further improves the performance of both multilingual TTS and low-resource language adaptation. With only 5 minutes of paired data for fine-tuning, our method achieved highly intelligible synthetic speech with the character error rates of around 6 % for a target language.
引用
收藏
页码:1829 / 1844
页数:16
相关论文
共 50 条
  • [41] Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition
    Feng, Siyuan
    Tu, Ming
    Xia, Rui
    Huang, Chuanzeng
    Wang, Yuxuan
    INTERSPEECH 2023, 2023, : 1384 - 1388
  • [42] A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition
    Du, Yeqian
    Zhang, Jie
    Zhu, Qiu-shi
    Dai, Lirong
    Wu, MingHui
    Fang, Xin
    Yang, ZhouWang
    INTERSPEECH 2022, 2022, : 2613 - 2617
  • [43] Opportunities and Challenges of Automatic Speech Recognition Systems for Low-Resource Language Speakers
    Reitmaier, Thomas
    Wallington, Electra
    Raju, Dani Kalarikalayil
    Klejch, Ondrej
    Pearson, Jennifer
    Jones, Matt
    Bell, Peter
    Robinson, Simon
    PROCEEDINGS OF THE 2022 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI' 22), 2022,
  • [44] Speech Dataset Development for a Low-Resource Tibeto-Burman Tonal Language
    Devi, Thiyam Susma
    Das, Pradip K.
    Proceedings of 2023 26th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2023, 2023,
  • [45] Investigate Automatic Speech Recognition and Keyword Search for Very Low-Resource Language
    Ni, Chongjia
    Ma, Bin
    2017 IEEE 2ND INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP), 2017, : 336 - 340
  • [46] Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements
    Nesterenko, Anton
    Akhmerov, Ruslan
    Matveeva, Yulia
    Goremykina, Anna
    Astankov, Dmitry
    Shuranov, Evgeniy
    Shirshova, Alexandra
    SPEECH AND COMPUTER, SPECOM 2022, 2022, 13721 : 508 - 521
  • [47] DATA-DRIVEN PHRASING FOR SPEECH SYNTHESIS IN LOW-RESOURCE LANGUAGES
    Parlikar, Alok
    Black, Alan W.
    2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4013 - 4016
  • [48] Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages
    Zhang, Haitong
    Lin, Yue
    INTERSPEECH 2020, 2020, : 3161 - 3165
  • [49] Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition
    Matsuura, Kohei
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    INTERSPEECH 2020, 2020, : 2737 - 2741
  • [50] Data Selection using Spoken Language Identification for Low-Resource and Zero-Resource Speech Recognition
    Chen, Jianan
    Chu, Chenhui
    Li, Sheng
    Kawahara, Tatsuya
    APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024, 2024,