THE THU-HCSI MULTI-SPEAKER MULTI-LINGUAL FEW-SHOT VOICE CLONING SYSTEM FOR LIMMITS'24 CHALLENGE<bold> </bold>

被引：0

作者：

Zhou, Yixuan ^{[1
]}

Zhou, Shuoyi ^{[1
]}

Lei, Shun ^{[1
]}

Wu, Zhiyong ^{[1
]}

Wu, Menglin ^{[2
]}

机构：

[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China

[2] ByteDance, Shanghai, Peoples R China

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024 | 2024年

关键词：

text-to-speech; voice cloning; few-shot; multi-speaker; multi-lingual<bold>; </bold>;

D O I：

10.1109/ICASSPW62465.2024.10626429

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper presents the multi-speaker multi-lingual few-shot voice cloning system developed by THU-HCSI team for LIMMITS'24 Challenge. To achieve high speaker similarity and naturalness in both mono-lingual and cross-lingual scenarios, we build the system upon YourTTS and add several enhancements. For further improving speaker similarity and speech quality, we introduce speaker-aware text encoder and flow-based decoder with Transformer blocks. In addition, we denoise the few-shot data, mix up them with pre-training data, and adopt a speaker-balanced sampling strategy to guarantee effective fine-tuning for target speakers. The official evaluations in track 1 show that our system achieves the best speaker similarity MOS of 4.25 and obtains considerable naturalness MOS of 3.97.<bold> </bold>

引用

页码：71 / 72

页数：2

共 3 条

[1] LIMMITS'24: MULTI-SPEAKER, MULTI-LINGUAL INDIC TTS WITH VOICE CLONING<bold> </bold>
Singh, Abhayjeet
Nagireddi, Amala
Deekshitha, G.
Bandekar, Jesuraja
Roopa, R.
Badiger, Sandhya
Udupa, Sathvik
Ghosh, Prasanta Kumar
Murthy, Hema A.
Kumar, Pranaw
Tokuda, Keiichi
Hasegawa-Johnson, Mark
Olbrich, Philipp
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 61 - 62
[2] LIMMITS'24: Multi-Speaker, Multi-Lingual INDIC TTS With Voice Cloning
Udupa, Sathvik
Bandekar, Jesuraja
Singh, Abhayjeet
Deekshitha, G.
Kumar, Saurabh
Badiger, Sandhya
Nagireddi, Amala
Roopa, R.
Ghosh, Prasanta Kumar
Murthy, Hema A.
Kumar, Pranaw
Tokuda, Keiichi
Hasegawa-Johnson, Mark
Olbrich, Philipp
IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2025, 6 : 293 - 302
[3] Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
Liu, Zhaoyu
Mak, Brian
INTERSPEECH 2020, 2020, : 2932 - 2936

← 1 →