ZERO-SHOT TEXT-TO-SPEECH SYNTHESIS CONDITIONED USING SELF-SUPERVISED SPEECH REPRESENTATION MODEL

被引:1
|
作者
Fujita, Kenichi [1 ]
Ashihara, Takanori [1 ]
Kanagawa, Hiroki [1 ]
Moriya, Takafumi [1 ]
Ijima, Yusuke [1 ]
机构
[1] NTT Corp, Tokyo, Japan
来源
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW | 2023年
关键词
Speech synthesis; self-supervised learning model; speaker embeddings; zero-shot TTS;
D O I
10.1109/ICASSPW59220.2023.10193459
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL). Conventional methods with embedding vectors from x-vector or global style tokens still have a gap in reproducing the speaker characteristics of unseen speakers. A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data. We also introduce the separate conditioning of acoustic features and a phoneme duration predictor to obtain the disentangled embeddings between rhythm-based speaker characteristics and acoustic-feature-based ones. The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches. Objective and subjective evaluations showed that the proposed method can synthesize speech with improved similarity and achieve speech-rhythm transfer.
引用
收藏
页数:5
相关论文
共 50 条
  • [41] Refining Self-Supervised Learnt Speech Representation using Brain Activations
    Li, Hengyu
    Mei, Kangdi
    Liu, Zhaoci
    Ai, Yang
    Chen, Liping
    Zhang, Jie
    Ling, Zhenhua
    INTERSPEECH 2024, 2024, : 1480 - 1484
  • [42] Self-Supervised Learning With Segmental Masking for Speech Representation
    Yue, Xianghu
    Lin, Jingru
    Gutierrez, Fabian Ritter
    Li, Haizhou
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1367 - 1379
  • [43] Phonetically Motivated Self-Supervised Speech Representation Learning
    Yue, Xianghu
    Li, Haizhou
    INTERSPEECH 2021, 2021, : 746 - 750
  • [44] Information Retrieval from Alternative Data using Zero-Shot Self-Supervised Learning
    Assareh, Amin
    2022 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE FOR FINANCIAL ENGINEERING AND ECONOMICS (CIFER), 2022,
  • [45] Zero-Shot Dialogue Disentanglement by Self-Supervised Entangled Response Selection
    Chi, Ta-Chung
    Rudnicky, Alexander, I
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 4897 - 4902
  • [46] A stochastic model of intonation for text-to-speech synthesis
    Véronis, J
    Di Cristo, P
    Courtois, F
    Chaumette, C
    SPEECH COMMUNICATION, 1998, 26 (04) : 233 - 244
  • [47] A prosodic model for text-to-speech synthesis in French
    Di Cristo, A
    Di Cristo, P
    Campione, E
    Véronis, J
    INTONATION: ANALYSIS, MODELLING AND TECHNOLOGY, 2000, 15 : 321 - 355
  • [48] Self-Supervised Knowledge Triplet Learning for Zero-Shot Question Answering
    Banerjee, Pratyay
    Baral, Chitta
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 151 - 162
  • [49] Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion
    Lei, Yi
    Yang, Shan
    Cong, Jian
    Xie, Lei
    Su, Dan
    INTERSPEECH 2022, 2022, : 2563 - 2567
  • [50] Self-supervised Context-aware Style Representation for Expressive Speech Synthesis
    Wu, Yihan
    Wang, Xi
    Zhang, Shaofei
    He, Lei
    Song, Ruihua
    Nie, Jian-Yun
    INTERSPEECH 2022, 2022, : 5503 - 5507