Text-to-Speech for Low-Resource Agglutinative Language With Morphology-Aware Language Model Pre-Training

被引：8

作者：

Liu, Rui ^{[1
]}

Hu, Yifan ^{[1
]}

Zuo, Haolin ^{[1
]}

Luo, Zhaojie ^{[2
]}

Wang, Longbiao ^{[3
]}

Gao, Guanglai ^{[1
]}

机构：

[1] Inner Mongolia Univ, Dept Comp Sci, Hohhot 010021, Peoples R China

[2] Osaka Univ, SANKEN, Osaka 5670047, Japan

[3] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin 300072, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Text-to-speech (TTS); agglutinative; morphology; language modeling; pre-training; END;

D O I：

10.1109/TASLP.2023.3348762

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Text-to-Speech (TTS) aims to convert the input text to a human-like voice. With the development of deep learning, encoder-decoder based TTS models perform superior performance, in terms of naturalness, in mainstream languages such as Chinese, English, etc. Note that the linguistic information learning capability of the text encoder is the key. However, for TTS of low-resource agglutinative languages, the scale of the <text, speech> paired data is limited. Therefore, how to extract rich linguistic information from small-scale text data to enhance the naturalness of the synthesized speech, is an urgent issue that needs to be addressed. In this paper, we first collect a large unsupervised text data for BERT-like language model pre-training, and then adopt the trained language model to extract deep linguistic information for the input text of the TTS model to improve the naturalness of the final synthesized speech. It should be emphasized that in order to fully exploit the prosody-related linguistic information in agglutinative languages, we incorporated morphological information into the language model training and constructed a morphology-aware masking based BERT model (MAM-BERT). Experimental results based on various advanced TTS models validate the effectiveness of our approach. Further comparison of the various data scales also validates the effectiveness of our approach in low-resource scenarios.

引用

页码：1075 / 1087

页数：13

共 50 条

[31] Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training
Biswas, Astik
Menon, Raghav
van der Westhuizen, Ewald
Niesler, Thomas
INTERSPEECH 2019, 2019, : 3008 - 3012
[32] Unified Language Model Pre-training for Natural Language Understanding and Generation
Dong, Li
Yang, Nan
Wang, Wenhui
Wei, Furu
Liu, Xiaodong
Wang, Yu
Gao, Jianfeng
Zhou, Ming
Hon, Hsiao-Wuen
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[33] SPEECH-LANGUAGE PRE-TRAINING FOR END-TO-END SPOKEN LANGUAGE UNDERSTANDING
Qian, Yao
Bianv, Ximo
Shi, Yu
Kanda, Naoyuki
Shen, Leo
Xiao, Zhen
Zeng, Michael
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7458 - 7462
[34] Optimization for Low-Resource Speaker Adaptation in End-to-End Text-to-Speech
Hong, Changi
Lee, Jung Hyuk
Jeon, Moongu
Kim, Hong Kook
2024 IEEE 21ST CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE, CCNC, 2024, : 1060 - 1061
[35] MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Ji, Yatai
Wang, Junjie
Gong, Yuan
Zhang, Lin
Zhu, Yanru
Wang, Hongfa
Zhang, Jiaxing
Sakai, Tetsuya
Yang, Yujiu
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23262 - 23271
[36] Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages
Zhang, Haitong
Lin, Yue
INTERSPEECH 2020, 2020, : 3161 - 3165
[37] Code-Mixed Text-to-Speech Synthesis Under Low-Resource Constraints
Joshi, Raviraj
Garera, Nikesh
SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 151 - 163
[38] Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data
Kang, Yu
Liu, Tianqiao
Li, Hang
Hao, Yang
Ding, Wenbiao
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10875 - 10883
[39] SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition
Hu, Hezhen
Zhao, Weichao
Zhou, Wengang
Wang, Yuechen
Li, Houqiang
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11067 - 11076
[40] Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation
Comini, Giulia
Huybrechts, Goeric
Ribeiro, Manuel Sam
Gabrys, Adam
Lorenzo-Trueba, Jaime
INTERSPEECH 2022, 2022, : 1946 - 1950

← 1 2 3 4 5 →