Text-to-Speech for Low-Resource Agglutinative Language With Morphology-Aware Language Model Pre-Training

被引:8
|
作者
Liu, Rui [1 ]
Hu, Yifan [1 ]
Zuo, Haolin [1 ]
Luo, Zhaojie [2 ]
Wang, Longbiao [3 ]
Gao, Guanglai [1 ]
机构
[1] Inner Mongolia Univ, Dept Comp Sci, Hohhot 010021, Peoples R China
[2] Osaka Univ, SANKEN, Osaka 5670047, Japan
[3] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin 300072, Peoples R China
基金
中国国家自然科学基金;
关键词
Text-to-speech (TTS); agglutinative; morphology; language modeling; pre-training; END;
D O I
10.1109/TASLP.2023.3348762
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Text-to-Speech (TTS) aims to convert the input text to a human-like voice. With the development of deep learning, encoder-decoder based TTS models perform superior performance, in terms of naturalness, in mainstream languages such as Chinese, English, etc. Note that the linguistic information learning capability of the text encoder is the key. However, for TTS of low-resource agglutinative languages, the scale of the <text, speech> paired data is limited. Therefore, how to extract rich linguistic information from small-scale text data to enhance the naturalness of the synthesized speech, is an urgent issue that needs to be addressed. In this paper, we first collect a large unsupervised text data for BERT-like language model pre-training, and then adopt the trained language model to extract deep linguistic information for the input text of the TTS model to improve the naturalness of the final synthesized speech. It should be emphasized that in order to fully exploit the prosody-related linguistic information in agglutinative languages, we incorporated morphological information into the language model training and constructed a morphology-aware masking based BERT model (MAM-BERT). Experimental results based on various advanced TTS models validate the effectiveness of our approach. Further comparison of the various data scales also validates the effectiveness of our approach in low-resource scenarios.
引用
收藏
页码:1075 / 1087
页数:13
相关论文
共 50 条
  • [31] Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training
    Biswas, Astik
    Menon, Raghav
    van der Westhuizen, Ewald
    Niesler, Thomas
    INTERSPEECH 2019, 2019, : 3008 - 3012
  • [32] Unified Language Model Pre-training for Natural Language Understanding and Generation
    Dong, Li
    Yang, Nan
    Wang, Wenhui
    Wei, Furu
    Liu, Xiaodong
    Wang, Yu
    Gao, Jianfeng
    Zhou, Ming
    Hon, Hsiao-Wuen
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [33] SPEECH-LANGUAGE PRE-TRAINING FOR END-TO-END SPOKEN LANGUAGE UNDERSTANDING
    Qian, Yao
    Bianv, Ximo
    Shi, Yu
    Kanda, Naoyuki
    Shen, Leo
    Xiao, Zhen
    Zeng, Michael
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7458 - 7462
  • [34] Optimization for Low-Resource Speaker Adaptation in End-to-End Text-to-Speech
    Hong, Changi
    Lee, Jung Hyuk
    Jeon, Moongu
    Kim, Hong Kook
    2024 IEEE 21ST CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE, CCNC, 2024, : 1060 - 1061
  • [35] MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
    Ji, Yatai
    Wang, Junjie
    Gong, Yuan
    Zhang, Lin
    Zhu, Yanru
    Wang, Hongfa
    Zhang, Jiaxing
    Sakai, Tetsuya
    Yang, Yujiu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23262 - 23271
  • [36] Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages
    Zhang, Haitong
    Lin, Yue
    INTERSPEECH 2020, 2020, : 3161 - 3165
  • [37] Code-Mixed Text-to-Speech Synthesis Under Low-Resource Constraints
    Joshi, Raviraj
    Garera, Nikesh
    SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 151 - 163
  • [38] Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data
    Kang, Yu
    Liu, Tianqiao
    Li, Hang
    Hao, Yang
    Ding, Wenbiao
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10875 - 10883
  • [39] SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition
    Hu, Hezhen
    Zhao, Weichao
    Zhou, Wengang
    Wang, Yuechen
    Li, Houqiang
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11067 - 11076
  • [40] Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation
    Comini, Giulia
    Huybrechts, Goeric
    Ribeiro, Manuel Sam
    Gabrys, Adam
    Lorenzo-Trueba, Jaime
    INTERSPEECH 2022, 2022, : 1946 - 1950