Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis

被引：4

作者：

Saeki, Takaaki ^{[1
]}

Maiti, Soumi ^{[2
]}

Li, Xinjian ^{[2
]}

Watanabe, Shinji ^{[2
]}

Takamichi, Shinnosuke ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Bunkyo Ku, Tokyo 1138656, Japan

[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Multilingual text-to-speech; low-resource adaptation; adaptation of masked language model; graphone;

D O I：

10.1109/TASLP.2024.3369537

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Neural text-to-speech (TTS) systems have made significant progress in generating natural synthetic speech. However, neural TTS requires large amounts of paired training data, which limits its applicability to a small number of resource-rich languages. Previous work on low-resource TTS has addressed the data hungriness based on transfer learning from a multilingual model to low-resource languages, but it still relies heavily on the availability of paired data for the target languages. In this paper, we propose a text-inductive language adaptation framework for low-resource TTS to address the cost of collecting the paired data for low-resource languages. To inject textual knowledge during transfer learning, our framework employs a two-stage adaptation scheme that utilizes both text-only and supervised data for the target language. In the text-based adaptation stage, we update the language-aware embedding layer with a masked language model objective using text-only data for the target language. In the supervised adaptation stage, the entire TTS model is updated using paired data for the target language. We also propose a graphone-based multilingual training method that jointly uses graphemes and International Phonetic Alphabet symbols (referred to as graphones) for resource-rich languages, while using only graphemes for low-resource languages. This approach facilitates the transfer of pronunciation knowledge from resource-rich to low-resource languages. Through extensive evaluations, we demonstrate that 1) our framework with text-based adaptation outperforms the previous supervised transfer learning approach and 2) the proposed graphone-based training method further improves the performance of both multilingual TTS and low-resource language adaptation. With only 5 minutes of paired data for fine-tuning, our method achieved highly intelligible synthetic speech with the character error rates of around 6 % for a target language.

引用

页码：1829 / 1844

页数：16

共 50 条

[41] Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition
Feng, Siyuan
Tu, Ming
Xia, Rui
Huang, Chuanzeng
Wang, Yuxuan
INTERSPEECH 2023, 2023, : 1384 - 1388
[42] A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition
Du, Yeqian
Zhang, Jie
Zhu, Qiu-shi
Dai, Lirong
Wu, MingHui
Fang, Xin
Yang, ZhouWang
INTERSPEECH 2022, 2022, : 2613 - 2617
[43] Opportunities and Challenges of Automatic Speech Recognition Systems for Low-Resource Language Speakers
Reitmaier, Thomas
Wallington, Electra
Raju, Dani Kalarikalayil
Klejch, Ondrej
Pearson, Jennifer
Jones, Matt
Bell, Peter
Robinson, Simon
PROCEEDINGS OF THE 2022 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI' 22), 2022,
[44] Speech Dataset Development for a Low-Resource Tibeto-Burman Tonal Language
Devi, Thiyam Susma
Das, Pradip K.
Proceedings of 2023 26th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2023, 2023,
[45] Investigate Automatic Speech Recognition and Keyword Search for Very Low-Resource Language
Ni, Chongjia
Ma, Bin
2017 IEEE 2ND INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP), 2017, : 336 - 340
[46] Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements
Nesterenko, Anton
Akhmerov, Ruslan
Matveeva, Yulia
Goremykina, Anna
Astankov, Dmitry
Shuranov, Evgeniy
Shirshova, Alexandra
SPEECH AND COMPUTER, SPECOM 2022, 2022, 13721 : 508 - 521
[47] DATA-DRIVEN PHRASING FOR SPEECH SYNTHESIS IN LOW-RESOURCE LANGUAGES
Parlikar, Alok
Black, Alan W.
2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4013 - 4016
[48] Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages
Zhang, Haitong
Lin, Yue
INTERSPEECH 2020, 2020, : 3161 - 3165
[49] Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition
Matsuura, Kohei
Mimura, Masato
Sakai, Shinsuke
Kawahara, Tatsuya
INTERSPEECH 2020, 2020, : 2737 - 2741
[50] Data Selection using Spoken Language Identification for Low-Resource and Zero-Resource Speech Recognition
Chen, Jianan
Chu, Chenhui
Li, Sheng
Kawahara, Tatsuya
APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024, 2024,

← 1 2 3 4 5 →