Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis

被引：4

作者：

Saeki, Takaaki ^{[1
]}

Maiti, Soumi ^{[2
]}

Li, Xinjian ^{[2
]}

Watanabe, Shinji ^{[2
]}

Takamichi, Shinnosuke ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Bunkyo Ku, Tokyo 1138656, Japan

[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Multilingual text-to-speech; low-resource adaptation; adaptation of masked language model; graphone;

D O I：

10.1109/TASLP.2024.3369537

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Neural text-to-speech (TTS) systems have made significant progress in generating natural synthetic speech. However, neural TTS requires large amounts of paired training data, which limits its applicability to a small number of resource-rich languages. Previous work on low-resource TTS has addressed the data hungriness based on transfer learning from a multilingual model to low-resource languages, but it still relies heavily on the availability of paired data for the target languages. In this paper, we propose a text-inductive language adaptation framework for low-resource TTS to address the cost of collecting the paired data for low-resource languages. To inject textual knowledge during transfer learning, our framework employs a two-stage adaptation scheme that utilizes both text-only and supervised data for the target language. In the text-based adaptation stage, we update the language-aware embedding layer with a masked language model objective using text-only data for the target language. In the supervised adaptation stage, the entire TTS model is updated using paired data for the target language. We also propose a graphone-based multilingual training method that jointly uses graphemes and International Phonetic Alphabet symbols (referred to as graphones) for resource-rich languages, while using only graphemes for low-resource languages. This approach facilitates the transfer of pronunciation knowledge from resource-rich to low-resource languages. Through extensive evaluations, we demonstrate that 1) our framework with text-based adaptation outperforms the previous supervised transfer learning approach and 2) the proposed graphone-based training method further improves the performance of both multilingual TTS and low-resource language adaptation. With only 5 minutes of paired data for fine-tuning, our method achieved highly intelligible synthetic speech with the character error rates of around 6 % for a target language.

引用

页码：1829 / 1844

页数：16

共 50 条

[21] Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora
Nespoli, Francesco
Barreda, Daniel
Naylor, Patrick A.
FIFTY-SEVENTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, IEEECONF, 2023, : 1080 - 1084
[22] Cross-Lingual Language Modeling for Low-Resource Speech Recognition
Xu, Ping
Fung, Pascale
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (06): : 1134 - 1144
[23] LOW-RESOURCE LANGUAGE IDENTIFICATION FROM SPEECH USING TRANSFER LEARNING
Feng, Kexin
Chaspari, Theodora
2019 IEEE 29TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2019,
[24] Language-universal phonetic encoder for low-resource speech recognition
Feng, Siyuan
Tu, Ming
Xia, Rui
Huang, Chuanzeng
Wang, Yuxuan
INTERSPEECH 2023, 2023, : 1429 - 1433
[25] A General Procedure for Improving Language Models in Low-Resource Speech Recognition
Liu, Qian
Zhang, Wei-Qiang
Liu, Jia
Liu, Yao
PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 428 - 433
[26] Language-Adversarial Transfer Learning for Low-Resource Speech Recognition
Yi, Jiangyan
Tao, Jianhua
Wen, Zhengqi
Bai, Ye
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (03) : 621 - 630
[27] TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS
Zhang, Xulong
Wang, Jianzong
Cheng, Ning
Xiao, Jing
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[28] Low-Resource Speech Synthesis with Speaker-Aware Embedding
Yang, Li-Jen
Yeh, I-Ping
Chien, Jen-Tzung
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 235 - 239
[29] Efficient Adaptation: Enhancing Multilingual Models for Low-Resource Language Translation
Sel, Ilhami
Hanbay, Davut
MATHEMATICS, 2024, 12 (19)
[30] LOW-RESOURCE SYSTEM FOR ALL-DIGITAL SPEECH SYNTHESIS
HERMAN, G
DUQUET, RT
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1977, 61 : S68 - S68

← 1 2 3 4 5 →