Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis

被引:4
|
作者
Saeki, Takaaki [1 ]
Maiti, Soumi [2 ]
Li, Xinjian [2 ]
Watanabe, Shinji [2 ]
Takamichi, Shinnosuke [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Bunkyo Ku, Tokyo 1138656, Japan
[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
关键词
Multilingual text-to-speech; low-resource adaptation; adaptation of masked language model; graphone;
D O I
10.1109/TASLP.2024.3369537
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Neural text-to-speech (TTS) systems have made significant progress in generating natural synthetic speech. However, neural TTS requires large amounts of paired training data, which limits its applicability to a small number of resource-rich languages. Previous work on low-resource TTS has addressed the data hungriness based on transfer learning from a multilingual model to low-resource languages, but it still relies heavily on the availability of paired data for the target languages. In this paper, we propose a text-inductive language adaptation framework for low-resource TTS to address the cost of collecting the paired data for low-resource languages. To inject textual knowledge during transfer learning, our framework employs a two-stage adaptation scheme that utilizes both text-only and supervised data for the target language. In the text-based adaptation stage, we update the language-aware embedding layer with a masked language model objective using text-only data for the target language. In the supervised adaptation stage, the entire TTS model is updated using paired data for the target language. We also propose a graphone-based multilingual training method that jointly uses graphemes and International Phonetic Alphabet symbols (referred to as graphones) for resource-rich languages, while using only graphemes for low-resource languages. This approach facilitates the transfer of pronunciation knowledge from resource-rich to low-resource languages. Through extensive evaluations, we demonstrate that 1) our framework with text-based adaptation outperforms the previous supervised transfer learning approach and 2) the proposed graphone-based training method further improves the performance of both multilingual TTS and low-resource language adaptation. With only 5 minutes of paired data for fine-tuning, our method achieved highly intelligible synthetic speech with the character error rates of around 6 % for a target language.
引用
收藏
页码:1829 / 1844
页数:16
相关论文
共 50 条
  • [1] Requirements and Motivations of Low-Resource Speech Synthesis for Language Revitalization
    Pine, Aidan
    Wells, Dan
    Brinklow, Nathan Thanyehtenhas
    Littell, Patrick
    Richmond, Korin
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7346 - 7359
  • [2] Low-Resource Speech-to-Text Translation
    Bansal, Sameer
    Kamper, Herman
    Livescu, Karen
    Lopez, Adam
    Goldwater, Sharon
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1298 - 1302
  • [3] Text-to-speech for low-resource systems
    Schnell, M
    Küstner, M
    Jokisch, O
    Hoffmann, R
    PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2002, : 259 - 262
  • [4] Optimization for Low-Resource Speaker Adaptation in End-to-End Text-to-Speech
    Hong, Changi
    Lee, Jung Hyuk
    Jeon, Moongu
    Kim, Hong Kook
    2024 IEEE 21ST CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE, CCNC, 2024, : 1060 - 1061
  • [5] CAM: A cross-lingual adaptation framework for low-resource language speech recognition
    Hu, Qing
    Zhang, Yan
    Zhang, Xianlei
    Han, Zongyu
    Yu, Xilong
    INFORMATION FUSION, 2024, 111
  • [6] Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning
    Medeiros, Eduardo
    Corado, Leonel
    Rato, Luis
    Quaresma, Paulo
    Salgueiro, Pedro
    FUTURE INTERNET, 2023, 15 (05)
  • [7] Hybrid Approach Text Generation for Low-Resource Language
    Rakhimova, Diana
    Adali, Esref
    Karibayeva, Aidana
    ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2024, PART I, 2024, 2165 : 256 - 268
  • [8] Linguistic Foundations of Low-Resource Languages for Speech Synthesis on the Example of the Kazakh Language
    Bekmanova, Gulmira
    Yergesh, Banu
    Sharipbay, Altynbek
    Omarbekova, Assel
    Zakirova, Alma
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2022 WORKSHOPS, PART III, 2022, 13379 : 3 - 14
  • [9] Code-Mixed Text-to-Speech Synthesis Under Low-Resource Constraints
    Joshi, Raviraj
    Garera, Nikesh
    SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 151 - 163
  • [10] DISTRIBUTION AUGMENTATION FOR LOW-RESOURCE EXPRESSIVE TEXT-TO-SPEECH
    Lajszczak, Mateusz
    Prasad, Animesh
    van Korlaar, Arent
    Bollepalli, Bajibabu
    Bonafonte, Antonio
    Joly, Arnaud
    Nicolis, Marco
    Moinet, Alexis
    Drugman, Thomas
    Wood, Trevor
    Sokolova, Elena
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8307 - 8311