SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody

被引:0
|
作者
Lu, Hui [1 ]
Wu, Xixin [1 ]
Wu, Zhiyong [2 ]
Meng, Helen [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[2] Tsinghua Univ, Shenzhen, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
中国国家自然科学基金;
关键词
Speech disentanglement; unsupervised representation learning; prosody modeling; VAE;
D O I
10.1145/3581783.3612485
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Disentangled speech representation learning aims to separate different factors of variation from speech into disjoint representations. This paper focuses on disentangling speech into representations for three factors: spoken content, speaker timbre, and speech prosody. Many previous methods for speech disentanglement have focused on separating spoken content and speaker timbre. However, the lack of explicit modeling of prosodic information leads to degraded speech generation performance and uncontrollable prosody leakage into content and/or speaker representations. While some recent methods have utilized explicit speaker labels or pre-trained models to facilitate triple-factor disentanglement, there are no end-to-end methods to simultaneously disentangle three factors using only unsupervised or self-supervised learning objectives. This paper introduces SpeechTripleNet, an end-to-end method to disentangle speech into representations for content, timbre, and prosody. Based on VAE, SpeechTripleNet restricts the structures of the latent variables and the amount of information captured in them to induce disentanglement. It is a pure unsupervised/self-supervised learning method that only requires speech data and no additional labels. Our qualitative and quantitative results demonstrate that SpeechTripleNet is effective in achieving triple-factor speech disentanglement, as well as controllable speech editing concerning different factors.
引用
收藏
页码:2829 / 2837
页数:9
相关论文
共 50 条
  • [1] CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation
    Zhao, Xiaohu
    Sun, Haoran
    Lei, Yikun
    Zhu, Shaolin
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 5920 - 5932
  • [2] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
    Pamisetty, Giridhar
    Murty, K. Sri Rama
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2023, 42 (01) : 361 - 384
  • [3] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
    Giridhar Pamisetty
    K. Sri Rama Murty
    Circuits, Systems, and Signal Processing, 2023, 42 : 361 - 384
  • [4] Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis
    Dai, Xudong
    Gong, Cheng
    Wang, Longbiao
    Zhang, Kaili
    INTERSPEECH 2021, 2021, : 131 - 135
  • [5] Deep End-to-End Representation Learning for Food Type Recognition from Speech
    Sertolli, Benjamin
    Cummins, Nicholas
    Sengur, Abdulkadir
    Schuller, Bjorn W.
    ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 574 - 578
  • [6] Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
    Skerry-Ryan, R. J.
    Battenberg, Eric
    Xiao, Ying
    Wang, Yuxuan
    Stanton, Daisy
    Shor, Joel
    Weiss, Ron J.
    Clark, Rob
    Saurous, Rif A.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80, 2018, 80
  • [7] End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation
    Chang, Xuankai
    Maekaku, Takashi
    Fujita, Yuya
    Watanabe, Shinji
    INTERSPEECH 2022, 2022, : 3819 - 3823
  • [8] Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation
    Yeh, Sung-Lin
    Lin, Yun-Shao
    Lee, Chi-Chun
    INTERSPEECH 2020, 2020, : 536 - 540
  • [9] ROBUST AND FINE-GRAINED PROSODY CONTROL OF END-TO-END SPEECH SYNTHESIS
    Lee, Younggun
    Kim, Taesu
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5911 - 5915
  • [10] Unsupervised Learning of Disentangled Speech Content and Style Representation
    Tjandra, Andros
    Pang, Ruoming
    Zhang, Yu
    Karita, Shigeki
    INTERSPEECH 2021, 2021, : 4089 - 4093