Text-to-Speech for Low-Resource Agglutinative Language With Morphology-Aware Language Model Pre-Training

被引:8
|
作者
Liu, Rui [1 ]
Hu, Yifan [1 ]
Zuo, Haolin [1 ]
Luo, Zhaojie [2 ]
Wang, Longbiao [3 ]
Gao, Guanglai [1 ]
机构
[1] Inner Mongolia Univ, Dept Comp Sci, Hohhot 010021, Peoples R China
[2] Osaka Univ, SANKEN, Osaka 5670047, Japan
[3] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin 300072, Peoples R China
基金
中国国家自然科学基金;
关键词
Text-to-speech (TTS); agglutinative; morphology; language modeling; pre-training; END;
D O I
10.1109/TASLP.2023.3348762
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Text-to-Speech (TTS) aims to convert the input text to a human-like voice. With the development of deep learning, encoder-decoder based TTS models perform superior performance, in terms of naturalness, in mainstream languages such as Chinese, English, etc. Note that the linguistic information learning capability of the text encoder is the key. However, for TTS of low-resource agglutinative languages, the scale of the <text, speech> paired data is limited. Therefore, how to extract rich linguistic information from small-scale text data to enhance the naturalness of the synthesized speech, is an urgent issue that needs to be addressed. In this paper, we first collect a large unsupervised text data for BERT-like language model pre-training, and then adopt the trained language model to extract deep linguistic information for the input text of the TTS model to improve the naturalness of the final synthesized speech. It should be emphasized that in order to fully exploit the prosody-related linguistic information in agglutinative languages, we incorporated morphological information into the language model training and constructed a morphology-aware masking based BERT model (MAM-BERT). Experimental results based on various advanced TTS models validate the effectiveness of our approach. Further comparison of the various data scales also validates the effectiveness of our approach in low-resource scenarios.
引用
收藏
页码:1075 / 1087
页数:13
相关论文
共 50 条
  • [41] Pre-training Techniques for Improving Text-to-Speech Synthesis by Automatic Speech Recognition Based Data Enhancement
    Liu, Yazhu
    Xue, Shaofei
    Tang, Jian
    MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 162 - 172
  • [42] Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis
    Saeki, Takaaki
    Maiti, Soumi
    Li, Xinjian
    Watanabe, Shinji
    Takamichi, Shinnosuke
    Saruwatari, Hiroshi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1829 - 1844
  • [43] Multi-Stage Pre-training for Low-Resource Domain Adaptation
    Zhang, Rong
    Reddy, Revanth Gangi
    Sultan, Md Arafat
    Castelli, Vittorio
    Ferritto, Anthony
    Florian, Radu
    Kayi, Efsun Sarioglu
    Roukos, Salim
    Sil, Avirup
    Ward, Todd
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 5461 - 5468
  • [44] FlauBERT: Unsupervised Language Model Pre-training for French
    Le, Hang
    Vial, Loic
    Frej, Jibril
    Segonne, Vincent
    Coavoux, Maximin
    Lecouteux, Benjamin
    Allauzen, Alexandre
    Crabbe, Benoit
    Besacier, Laurent
    Schwab, Didier
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2479 - 2490
  • [45] Soft Language Clustering for Multilingual Model Pre-training
    Zeng, Jiali
    Jiang, Yufan
    Yin, Yongjing
    Jing, Yi
    Meng, Fandong
    Lin, Binghuai
    Cao, Yunbo
    Zhou, Jie
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 7021 - 7035
  • [46] Pre-Training on Mixed Data for Low-Resource Neural Machine Translation
    Zhang, Wenbo
    Li, Xiao
    Yang, Yating
    Dong, Rui
    INFORMATION, 2021, 12 (03)
  • [47] Named-Entity Recognition for a Low-resource Language using Pre-Trained Language Model
    Yohannes, Hailemariam Mehari
    Amagasa, Toshiyuki
    37TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, 2022, : 837 - 844
  • [48] Object-aware Video-language Pre-training for Retrieval
    Wang, Alex Jinpeng
    Ge, Yixiao
    Cai, Guanyu
    Yan, Rui
    Lin, Xudong
    Shan, Ying
    Qie, Xiaohu
    Shou, Mike Zheng
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3303 - 3312
  • [49] Improving Medical Speech-to-Text Accuracy using Vision-Language Pre-training Models
    Huh, Jaeyoung
    Park, Sangjoon
    Lee, Jeong Eun
    Ye, Jong Chul
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (03) : 1692 - 1703
  • [50] Vision-Language Pre-Training for Boosting Scene Text Detectors
    Song, Sibo
    Wan, Jianqiang
    Yang, Zhibo
    Tang, Jun
    Cheng, Wenqing
    Bai, Xiang
    Yao, Cong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15660 - 15670