Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis

被引：4

作者：

Saeki, Takaaki ^{[1
]}

Maiti, Soumi ^{[2
]}

Li, Xinjian ^{[2
]}

Watanabe, Shinji ^{[2
]}

Takamichi, Shinnosuke ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Bunkyo Ku, Tokyo 1138656, Japan

[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Multilingual text-to-speech; low-resource adaptation; adaptation of masked language model; graphone;

D O I：

10.1109/TASLP.2024.3369537

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Neural text-to-speech (TTS) systems have made significant progress in generating natural synthetic speech. However, neural TTS requires large amounts of paired training data, which limits its applicability to a small number of resource-rich languages. Previous work on low-resource TTS has addressed the data hungriness based on transfer learning from a multilingual model to low-resource languages, but it still relies heavily on the availability of paired data for the target languages. In this paper, we propose a text-inductive language adaptation framework for low-resource TTS to address the cost of collecting the paired data for low-resource languages. To inject textual knowledge during transfer learning, our framework employs a two-stage adaptation scheme that utilizes both text-only and supervised data for the target language. In the text-based adaptation stage, we update the language-aware embedding layer with a masked language model objective using text-only data for the target language. In the supervised adaptation stage, the entire TTS model is updated using paired data for the target language. We also propose a graphone-based multilingual training method that jointly uses graphemes and International Phonetic Alphabet symbols (referred to as graphones) for resource-rich languages, while using only graphemes for low-resource languages. This approach facilitates the transfer of pronunciation knowledge from resource-rich to low-resource languages. Through extensive evaluations, we demonstrate that 1) our framework with text-based adaptation outperforms the previous supervised transfer learning approach and 2) the proposed graphone-based training method further improves the performance of both multilingual TTS and low-resource language adaptation. With only 5 minutes of paired data for fine-tuning, our method achieved highly intelligible synthetic speech with the character error rates of around 6 % for a target language.

引用

页码：1829 / 1844

页数：16

共 50 条

[31] ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios
Wang, Yuyue
Xiao, Huan
Wu, Yihan
Song, Ruihua
INTERSPEECH 2023, 2023, : 4828 - 4832
[32] USING SPEECH ENHANCEMENT TO REALIZE SPEECH SYNTHESIS OF LOW-RESOURCE DUNGAN LANGUAGES
Jiang, Rui
Chen, Chengsi
Shan, Xin
Yang, Hongwu
2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2021, : 193 - 198
[33] Review of Speech Synthesis Methods Under Low-Resource Condition
Jialin, Zhang
Wushouer, Mairidan
Tuerhong, Gulanbaier
Computer Engineering and Applications, 2023, 59 (15): : 1 - 16
[34] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
Zolzaya Byambadorj
Ryota Nishimura
Altangerel Ayush
Kengo Ohta
Norihide Kitaoka
EURASIP Journal on Audio, Speech, and Music Processing, 2021
[35] DOMAIN ADAPTATION OF END-TO-END SPEECH RECOGNITION IN LOW-RESOURCE SETTINGS
Samarakoon, Lahiru
Mak, Brian
Lam, Albert Y. S.
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 382 - 388
[36] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
Byambadorj, Zolzaya
Nishimura, Ryota
Ayush, Altangerel
Ohta, Kengo
Kitaoka, Norihide
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
[37] SPEECH SYNTHESIS USING HMM BASED DIPHONE INVENTORY ENCODING FOR LOW-RESOURCE DEVICES
Strecha, Guntram
Wolff, Matthias
2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 5380 - 5383
[38] Pre-training on High-Resource Speech Recognition Improves Low-Resource Speech-to-Text Translation
Bansal, Sameer
Kamper, Herman
Livescu, Karen
Lopez, Adam
Goldwater, Sharon
2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 58 - 68
[39] Application of Quantum Recurrent Neural Network in Low-Resource Language Text Classification
Yu, Wenbin
Yin, Lei
Zhang, Chengjun
Chen, Yadang
Liu, Alex X.
IEEE TRANSACTIONS ON QUANTUM ENGINEERING, 2024, 5
[40] Prompt-based for Low-Resource Tibetan Text Classification
An, Bo
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (08)

← 1 2 3 4 5 →