Low-Resource Speech Synthesis with Speaker-Aware Embedding

被引：4

作者：

Yang, Li-Jen ^{[1
]}

Yeh, I-Ping ^{[2
]}

Chien, Jen-Tzung ^{[1
]}

机构：

[1] Natl Yang Ming Chiao Tung Univ, Inst Elect & Comp Engn, Hsinchu, Taiwan

[2] Natl Yang Ming Chiao Tung Univ, Grad Degree Program Cybersecur, Hsinchu, Taiwan

来源：

2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) | 2022年

关键词：

low-resource speech synthesis; speaker-aware embedding; encoder-decoder model; transformer; NETWORKS;

D O I：

10.1109/ISCSLP57327.2022.10038221

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech synthesis has been successfully exploited for mapping from text sequence to speech waveform where high-resource languages have been well studied and learned from a large amount of text-speech paired data in public-domain corpora. However, developing speech synthesis under low-resource languages is challenging for speech communication in local regions since the collection of training data is expensive. In particular, the speaker-aware speech generation under low-resource settings is crucial in real world. Such a problem is increasingly difficult in case of very limited speaker-specific data. This paper presents a speaker-aware speech synthesis under low-resource settings based on an encoder-decoder framework by using transformer. Knowledge transfer is performed by incorporating a speaker-aware embedding through first learning a pretrained transformer from multi-speaker data of a low-populated spoken language and then fine-tuning the transformer to a target speaker with very limited speaker-specific embeddings. Experiments on low-resource Taiwanese speech synthesis are evaluated to show the merit of speaker-aware transformer in terms of Mel cepstral distortion and mean opinion score.

引用

页码：235 / 239

页数：5

共 50 条

[31] SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech
Lin, Jingru
Ge, Meng
Ao, Junyi
Deng, Liqun
Li, Haizhou
INTERSPEECH 2024, 2024, : 597 - 601
[32] Speaker-Aware Long Short-Term Memory Multi-Task Learning for Speech Recognition
Pironkov, Gueorgui
Dupont, Stephane
Dutoit, Thierry
2016 24TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2016, : 1911 - 1915
[33] CURRICULUM OPTIMIZATION FOR LOW-RESOURCE SPEECH RECOGNITION
Kuznetsova, Anastasia
Kumar, Anurag
Fox, Jennifer Drexler
Tyers, Francis M.
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8187 - 8191
[34] SPEAKER AUGMENTATION FOR LOW RESOURCE SPEECH RECOGNITION
Du, Chenpeng
Yu, Kai
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7719 - 7723
[35] LOW-RESOURCE CONTEXTUAL TOPIC IDENTIFICATION ON SPEECH
Liu, Chunxi
Wiesner, Matthew
Watanabe, Shinji
Harman, Craig
Trmal, Jan
Dehak, Najim
Khudanpur, Sanjeev
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 656 - 663
[36] Semi-supervised Multichannel Speech Separation Based on a Phone- and Speaker-Aware Deep Generative Model of Speech Spectrograms
Du, Yicheng
Sekiguchi, Kouhei
Bando, Yoshiaki
Nugraha, Aditya Arie
Fontaine, Mathieu
Yoshii, Kazuyoshi
Kawahara, Tatsuya
28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 870 - 874
[37] Low-Resource Speech-to-Text Translation
Bansal, Sameer
Kamper, Herman
Livescu, Karen
Lopez, Adam
Goldwater, Sharon
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1298 - 1302
[38] Text-to-speech for low-resource systems
Schnell, M
Küstner, M
Jokisch, O
Hoffmann, R
PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2002, : 259 - 262
[39] OntoED: Low-resource Event Detection with Ontology Embedding
Deng, Shumin
Zhang, Ningyu
Li, Luoqiu
Chen, Hui
Tou, Huaixiao
Chen, Mosha
Huang, Fei
Chen, Huajun
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 2828 - 2839
[40] Enrollment in low-resource speech recognition systems
Deligne, S
Dharanipragada, S
2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 341 - 344

← 1 2 3 4 5 →