Low-Resource Speech Synthesis with Speaker-Aware Embedding

被引:4
|
作者
Yang, Li-Jen [1 ]
Yeh, I-Ping [2 ]
Chien, Jen-Tzung [1 ]
机构
[1] Natl Yang Ming Chiao Tung Univ, Inst Elect & Comp Engn, Hsinchu, Taiwan
[2] Natl Yang Ming Chiao Tung Univ, Grad Degree Program Cybersecur, Hsinchu, Taiwan
关键词
low-resource speech synthesis; speaker-aware embedding; encoder-decoder model; transformer; NETWORKS;
D O I
10.1109/ISCSLP57327.2022.10038221
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech synthesis has been successfully exploited for mapping from text sequence to speech waveform where high-resource languages have been well studied and learned from a large amount of text-speech paired data in public-domain corpora. However, developing speech synthesis under low-resource languages is challenging for speech communication in local regions since the collection of training data is expensive. In particular, the speaker-aware speech generation under low-resource settings is crucial in real world. Such a problem is increasingly difficult in case of very limited speaker-specific data. This paper presents a speaker-aware speech synthesis under low-resource settings based on an encoder-decoder framework by using transformer. Knowledge transfer is performed by incorporating a speaker-aware embedding through first learning a pretrained transformer from multi-speaker data of a low-populated spoken language and then fine-tuning the transformer to a target speaker with very limited speaker-specific embeddings. Experiments on low-resource Taiwanese speech synthesis are evaluated to show the merit of speaker-aware transformer in terms of Mel cepstral distortion and mean opinion score.
引用
收藏
页码:235 / 239
页数:5
相关论文
共 50 条
  • [31] SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech
    Lin, Jingru
    Ge, Meng
    Ao, Junyi
    Deng, Liqun
    Li, Haizhou
    INTERSPEECH 2024, 2024, : 597 - 601
  • [32] Speaker-Aware Long Short-Term Memory Multi-Task Learning for Speech Recognition
    Pironkov, Gueorgui
    Dupont, Stephane
    Dutoit, Thierry
    2016 24TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2016, : 1911 - 1915
  • [33] CURRICULUM OPTIMIZATION FOR LOW-RESOURCE SPEECH RECOGNITION
    Kuznetsova, Anastasia
    Kumar, Anurag
    Fox, Jennifer Drexler
    Tyers, Francis M.
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8187 - 8191
  • [34] SPEAKER AUGMENTATION FOR LOW RESOURCE SPEECH RECOGNITION
    Du, Chenpeng
    Yu, Kai
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7719 - 7723
  • [35] LOW-RESOURCE CONTEXTUAL TOPIC IDENTIFICATION ON SPEECH
    Liu, Chunxi
    Wiesner, Matthew
    Watanabe, Shinji
    Harman, Craig
    Trmal, Jan
    Dehak, Najim
    Khudanpur, Sanjeev
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 656 - 663
  • [36] Semi-supervised Multichannel Speech Separation Based on a Phone- and Speaker-Aware Deep Generative Model of Speech Spectrograms
    Du, Yicheng
    Sekiguchi, Kouhei
    Bando, Yoshiaki
    Nugraha, Aditya Arie
    Fontaine, Mathieu
    Yoshii, Kazuyoshi
    Kawahara, Tatsuya
    28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 870 - 874
  • [37] Low-Resource Speech-to-Text Translation
    Bansal, Sameer
    Kamper, Herman
    Livescu, Karen
    Lopez, Adam
    Goldwater, Sharon
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1298 - 1302
  • [38] Text-to-speech for low-resource systems
    Schnell, M
    Küstner, M
    Jokisch, O
    Hoffmann, R
    PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2002, : 259 - 262
  • [39] OntoED: Low-resource Event Detection with Ontology Embedding
    Deng, Shumin
    Zhang, Ningyu
    Li, Luoqiu
    Chen, Hui
    Tou, Huaixiao
    Chen, Mosha
    Huang, Fei
    Chen, Huajun
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 2828 - 2839
  • [40] Enrollment in low-resource speech recognition systems
    Deligne, S
    Dharanipragada, S
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 341 - 344