Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis

被引：4

作者：

Saeki, Takaaki ^{[1
]}

Maiti, Soumi ^{[2
]}

Li, Xinjian ^{[2
]}

Watanabe, Shinji ^{[2
]}

Takamichi, Shinnosuke ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Bunkyo Ku, Tokyo 1138656, Japan

[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Multilingual text-to-speech; low-resource adaptation; adaptation of masked language model; graphone;

D O I：

10.1109/TASLP.2024.3369537

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Neural text-to-speech (TTS) systems have made significant progress in generating natural synthetic speech. However, neural TTS requires large amounts of paired training data, which limits its applicability to a small number of resource-rich languages. Previous work on low-resource TTS has addressed the data hungriness based on transfer learning from a multilingual model to low-resource languages, but it still relies heavily on the availability of paired data for the target languages. In this paper, we propose a text-inductive language adaptation framework for low-resource TTS to address the cost of collecting the paired data for low-resource languages. To inject textual knowledge during transfer learning, our framework employs a two-stage adaptation scheme that utilizes both text-only and supervised data for the target language. In the text-based adaptation stage, we update the language-aware embedding layer with a masked language model objective using text-only data for the target language. In the supervised adaptation stage, the entire TTS model is updated using paired data for the target language. We also propose a graphone-based multilingual training method that jointly uses graphemes and International Phonetic Alphabet symbols (referred to as graphones) for resource-rich languages, while using only graphemes for low-resource languages. This approach facilitates the transfer of pronunciation knowledge from resource-rich to low-resource languages. Through extensive evaluations, we demonstrate that 1) our framework with text-based adaptation outperforms the previous supervised transfer learning approach and 2) the proposed graphone-based training method further improves the performance of both multilingual TTS and low-resource language adaptation. With only 5 minutes of paired data for fine-tuning, our method achieved highly intelligible synthetic speech with the character error rates of around 6 % for a target language.

引用

页码：1829 / 1844

页数：16

共 50 条

[1] Requirements and Motivations of Low-Resource Speech Synthesis for Language Revitalization
Pine, Aidan
Wells, Dan
Brinklow, Nathan Thanyehtenhas
Littell, Patrick
Richmond, Korin
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7346 - 7359
[2] Low-Resource Speech-to-Text Translation
Bansal, Sameer
Kamper, Herman
Livescu, Karen
Lopez, Adam
Goldwater, Sharon
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1298 - 1302
[3] Text-to-speech for low-resource systems
Schnell, M
Küstner, M
Jokisch, O
Hoffmann, R
PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2002, : 259 - 262
[4] Optimization for Low-Resource Speaker Adaptation in End-to-End Text-to-Speech
Hong, Changi
Lee, Jung Hyuk
Jeon, Moongu
Kim, Hong Kook
2024 IEEE 21ST CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE, CCNC, 2024, : 1060 - 1061
[5] CAM: A cross-lingual adaptation framework for low-resource language speech recognition
Hu, Qing
Zhang, Yan
Zhang, Xianlei
Han, Zongyu
Yu, Xilong
INFORMATION FUSION, 2024, 111
[6] Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning
Medeiros, Eduardo
Corado, Leonel
Rato, Luis
Quaresma, Paulo
Salgueiro, Pedro
FUTURE INTERNET, 2023, 15 (05)
[7] Hybrid Approach Text Generation for Low-Resource Language
Rakhimova, Diana
Adali, Esref
Karibayeva, Aidana
ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2024, PART I, 2024, 2165 : 256 - 268
[8] Linguistic Foundations of Low-Resource Languages for Speech Synthesis on the Example of the Kazakh Language
Bekmanova, Gulmira
Yergesh, Banu
Sharipbay, Altynbek
Omarbekova, Assel
Zakirova, Alma
COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2022 WORKSHOPS, PART III, 2022, 13379 : 3 - 14
[9] Code-Mixed Text-to-Speech Synthesis Under Low-Resource Constraints
Joshi, Raviraj
Garera, Nikesh
SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 151 - 163
[10] DISTRIBUTION AUGMENTATION FOR LOW-RESOURCE EXPRESSIVE TEXT-TO-SPEECH
Lajszczak, Mateusz
Prasad, Animesh
van Korlaar, Arent
Bollepalli, Bajibabu
Bonafonte, Antonio
Joly, Arnaud
Nicolis, Marco
Moinet, Alexis
Drugman, Thomas
Wood, Trevor
Sokolova, Elena
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8307 - 8311

← 1 2 3 4 5 →