THE THU-HCSI MULTI-SPEAKER MULTI-LINGUAL FEW-SHOT VOICE CLONING SYSTEM FOR LIMMITS'24 CHALLENGE<bold> </bold>

被引:0
|
作者
Zhou, Yixuan [1 ]
Zhou, Shuoyi [1 ]
Lei, Shun [1 ]
Wu, Zhiyong [1 ]
Wu, Menglin [2 ]
机构
[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] ByteDance, Shanghai, Peoples R China
关键词
text-to-speech; voice cloning; few-shot; multi-speaker; multi-lingual<bold>; </bold>;
D O I
10.1109/ICASSPW62465.2024.10626429
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents the multi-speaker multi-lingual few-shot voice cloning system developed by THU-HCSI team for LIMMITS'24 Challenge. To achieve high speaker similarity and naturalness in both mono-lingual and cross-lingual scenarios, we build the system upon YourTTS and add several enhancements. For further improving speaker similarity and speech quality, we introduce speaker-aware text encoder and flow-based decoder with Transformer blocks. In addition, we denoise the few-shot data, mix up them with pre-training data, and adopt a speaker-balanced sampling strategy to guarantee effective fine-tuning for target speakers. The official evaluations in track 1 show that our system achieves the best speaker similarity MOS of 4.25 and obtains considerable naturalness MOS of 3.97.<bold> </bold>
引用
收藏
页码:71 / 72
页数:2
相关论文
共 3 条
  • [1] LIMMITS'24: MULTI-SPEAKER, MULTI-LINGUAL INDIC TTS WITH VOICE CLONING<bold> </bold>
    Singh, Abhayjeet
    Nagireddi, Amala
    Deekshitha, G.
    Bandekar, Jesuraja
    Roopa, R.
    Badiger, Sandhya
    Udupa, Sathvik
    Ghosh, Prasanta Kumar
    Murthy, Hema A.
    Kumar, Pranaw
    Tokuda, Keiichi
    Hasegawa-Johnson, Mark
    Olbrich, Philipp
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 61 - 62
  • [2] LIMMITS'24: Multi-Speaker, Multi-Lingual INDIC TTS With Voice Cloning
    Udupa, Sathvik
    Bandekar, Jesuraja
    Singh, Abhayjeet
    Deekshitha, G.
    Kumar, Saurabh
    Badiger, Sandhya
    Nagireddi, Amala
    Roopa, R.
    Ghosh, Prasanta Kumar
    Murthy, Hema A.
    Kumar, Pranaw
    Tokuda, Keiichi
    Hasegawa-Johnson, Mark
    Olbrich, Philipp
    IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2025, 6 : 293 - 302
  • [3] Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
    Liu, Zhaoyu
    Mak, Brian
    INTERSPEECH 2020, 2020, : 2932 - 2936