Text-to-Speech for Low-Resource Agglutinative Language With Morphology-Aware Language Model Pre-Training

被引:8
|
作者
Liu, Rui [1 ]
Hu, Yifan [1 ]
Zuo, Haolin [1 ]
Luo, Zhaojie [2 ]
Wang, Longbiao [3 ]
Gao, Guanglai [1 ]
机构
[1] Inner Mongolia Univ, Dept Comp Sci, Hohhot 010021, Peoples R China
[2] Osaka Univ, SANKEN, Osaka 5670047, Japan
[3] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin 300072, Peoples R China
基金
中国国家自然科学基金;
关键词
Text-to-speech (TTS); agglutinative; morphology; language modeling; pre-training; END;
D O I
10.1109/TASLP.2023.3348762
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Text-to-Speech (TTS) aims to convert the input text to a human-like voice. With the development of deep learning, encoder-decoder based TTS models perform superior performance, in terms of naturalness, in mainstream languages such as Chinese, English, etc. Note that the linguistic information learning capability of the text encoder is the key. However, for TTS of low-resource agglutinative languages, the scale of the <text, speech> paired data is limited. Therefore, how to extract rich linguistic information from small-scale text data to enhance the naturalness of the synthesized speech, is an urgent issue that needs to be addressed. In this paper, we first collect a large unsupervised text data for BERT-like language model pre-training, and then adopt the trained language model to extract deep linguistic information for the input text of the TTS model to improve the naturalness of the final synthesized speech. It should be emphasized that in order to fully exploit the prosody-related linguistic information in agglutinative languages, we incorporated morphological information into the language model training and constructed a morphology-aware masking based BERT model (MAM-BERT). Experimental results based on various advanced TTS models validate the effectiveness of our approach. Further comparison of the various data scales also validates the effectiveness of our approach in low-resource scenarios.
引用
收藏
页码:1075 / 1087
页数:13
相关论文
共 50 条
  • [1] Text-to-speech for low-resource systems
    Schnell, M
    Küstner, M
    Jokisch, O
    Hoffmann, R
    PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2002, : 259 - 262
  • [2] Pre-training on High-Resource Speech Recognition Improves Low-Resource Speech-to-Text Translation
    Bansal, Sameer
    Kamper, Herman
    Livescu, Karen
    Lopez, Adam
    Goldwater, Sharon
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 58 - 68
  • [3] KinyaBERT: a Morphology-aware Kinyarwanda Language Model
    Nzeyimana, Antoine
    Rubungo, Andre Niyongabo
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 5347 - 5363
  • [4] Joint Learning Model for Low-Resource Agglutinative Language Morphological Tagging
    Abudouwaili, Gulinigeer
    Abiderexiti, Kahaerjiang
    Yi, Nian
    Wumaier, Aishan
    Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2023, : 27 - 37
  • [5] Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features
    Lux, Florian
    Vu, Ngoc Thang
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6858 - 6868
  • [6] ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks
    Pelloin, Valentin
    Dary, Franck
    Herve, Nicolas
    Favre, Benoit
    Camelin, Nathalie
    Laurent, Antoine
    Besacier, Laurent
    INTERSPEECH 2022, 2022, : 3453 - 3457
  • [7] Does Masked Language Model Pre-training with Artificial Data Improve Low-resource Neural Machine Translation?
    Tamura, Hiroto
    Hirasawa, Tosho
    Kim, Hwichan
    Komachi, Mamoru
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2216 - 2225
  • [8] AgglutiFiT: Efficient Low-Resource Agglutinative Language Model Fine-Tuning
    Li, Zhe
    Li, Xiuhong
    Sheng, Jiabao
    Slamu, Wushour
    IEEE ACCESS, 2020, 8 : 148489 - 148499
  • [9] Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation
    Liu, Zihan
    Winata, Genta Indra
    Fung, Pascale
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2706 - 2718
  • [10] DISTRIBUTION AUGMENTATION FOR LOW-RESOURCE EXPRESSIVE TEXT-TO-SPEECH
    Lajszczak, Mateusz
    Prasad, Animesh
    van Korlaar, Arent
    Bollepalli, Bajibabu
    Bonafonte, Antonio
    Joly, Arnaud
    Nicolis, Marco
    Moinet, Alexis
    Drugman, Thomas
    Wood, Trevor
    Sokolova, Elena
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8307 - 8311