ZERO-SHOT TEXT-TO-SPEECH SYNTHESIS CONDITIONED USING SELF-SUPERVISED SPEECH REPRESENTATION MODEL

被引:1
|
作者
Fujita, Kenichi [1 ]
Ashihara, Takanori [1 ]
Kanagawa, Hiroki [1 ]
Moriya, Takafumi [1 ]
Ijima, Yusuke [1 ]
机构
[1] NTT Corp, Tokyo, Japan
关键词
Speech synthesis; self-supervised learning model; speaker embeddings; zero-shot TTS;
D O I
10.1109/ICASSPW59220.2023.10193459
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL). Conventional methods with embedding vectors from x-vector or global style tokens still have a gap in reproducing the speaker characteristics of unseen speakers. A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data. We also introduce the separate conditioning of acoustic features and a phoneme duration predictor to obtain the disentangled embeddings between rhythm-based speaker characteristics and acoustic-feature-based ones. The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches. Objective and subjective evaluations showed that the proposed method can synthesize speech with improved similarity and achieve speech-rhythm transfer.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations
    Gong, Cheng
    Wang, Xin
    Cooper, Erica
    Wells, Dan
    Wang, Longbiao
    Dang, Jianwu
    Richmond, Korin
    Yamagishi, Junichi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4036 - 4051
  • [2] VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in theWild
    Peng, Puyuan
    Huang, Po-Yao
    Le, Shang-Wen
    Mohamed, Abdelrahman
    Harwath, David
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12442 - 12462
  • [3] XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
    Casanova, Edresson
    Davis, Kelly
    Goelge, Eren
    Goekncar, Gorkem
    Gulea, Iulian
    Hart, Logan
    Aljafari, Aya
    Meyer, Joshua
    Morais, Reuben
    Olayemi, Samuel
    Weber, Julian
    INTERSPEECH 2024, 2024, : 4978 - 4982
  • [4] SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis
    Sivaguru, Ramanan
    Lodagala, Vasista Sai
    Umesh, S.
    INTERSPEECH 2023, 2023, : 3033 - 3037
  • [5] Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis
    Chen, Zhiyong
    Ai, Zhiqi
    Ma, Youxuan
    Li, Xinnuo
    Xu, Shugong
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2024, 2024 (01):
  • [6] EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH
    Lux, Florian
    Koch, Julia
    Vu, Ngoc Thang
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 962 - 969
  • [7] Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
    Yoon, Hyungchan
    Kim, Changhwan
    Song, Eunwoo
    Yoon, Hyun-Wook
    Kang, Hong-Goo
    INTERSPEECH 2023, 2023, : 4299 - 4303
  • [8] Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
    Tang, Chuanxin
    Luo, Chong
    Zhao, Zhiyuan
    Yin, Dacheng
    Zhao, Yucheng
    Zeng, Wenjun
    INTERSPEECH 2021, 2021, : 3600 - 3604
  • [9] Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
    Azizah, Kurniawati
    IEEE ACCESS, 2024, 12 : 63528 - 63547
  • [10] Zero-Shot Text Classification via Self-Supervised Tuning
    Liu, Chaoqun
    Zhang, Wenxuan
    Chen, Guizhen
    Wu, Xiaobao
    Luu, Anh Tuan
    Chang, Chip Hong
    Bing, Lidong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 1743 - 1761