SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody

被引：0

作者：

Lu, Hui ^{[1
]}

Wu, Xixin ^{[1
]}

Wu, Zhiyong ^{[2
]}

Meng, Helen ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[2] Tsinghua Univ, Shenzhen, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Speech disentanglement; unsupervised representation learning; prosody modeling; VAE;

D O I：

10.1145/3581783.3612485

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Disentangled speech representation learning aims to separate different factors of variation from speech into disjoint representations. This paper focuses on disentangling speech into representations for three factors: spoken content, speaker timbre, and speech prosody. Many previous methods for speech disentanglement have focused on separating spoken content and speaker timbre. However, the lack of explicit modeling of prosodic information leads to degraded speech generation performance and uncontrollable prosody leakage into content and/or speaker representations. While some recent methods have utilized explicit speaker labels or pre-trained models to facilitate triple-factor disentanglement, there are no end-to-end methods to simultaneously disentangle three factors using only unsupervised or self-supervised learning objectives. This paper introduces SpeechTripleNet, an end-to-end method to disentangle speech into representations for content, timbre, and prosody. Based on VAE, SpeechTripleNet restricts the structures of the latent variables and the amount of information captured in them to induce disentanglement. It is a pure unsupervised/self-supervised learning method that only requires speech data and no additional labels. Our qualitative and quantitative results demonstrate that SpeechTripleNet is effective in achieving triple-factor speech disentanglement, as well as controllable speech editing concerning different factors.

引用

页码：2829 / 2837

页数：9

共 50 条

[1] CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation
Zhao, Xiaohu
Sun, Haoran
Lei, Yikun
Zhu, Shaolin
Xiong, Deyi
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 5920 - 5932
[2] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
Pamisetty, Giridhar
Murty, K. Sri Rama
CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2023, 42 (01) : 361 - 384
[3] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
Giridhar Pamisetty
K. Sri Rama Murty
Circuits, Systems, and Signal Processing, 2023, 42 : 361 - 384
[4] Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis
Dai, Xudong
Gong, Cheng
Wang, Longbiao
Zhang, Kaili
INTERSPEECH 2021, 2021, : 131 - 135
[5] Deep End-to-End Representation Learning for Food Type Recognition from Speech
Sertolli, Benjamin
Cummins, Nicholas
Sengur, Abdulkadir
Schuller, Bjorn W.
ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 574 - 578
[6] Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
Skerry-Ryan, R. J.
Battenberg, Eric
Xiao, Ying
Wang, Yuxuan
Stanton, Daisy
Shor, Joel
Weiss, Ron J.
Clark, Rob
Saurous, Rif A.
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80, 2018, 80
[7] End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation
Chang, Xuankai
Maekaku, Takashi
Fujita, Yuya
Watanabe, Shinji
INTERSPEECH 2022, 2022, : 3819 - 3823
[8] Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation
Yeh, Sung-Lin
Lin, Yun-Shao
Lee, Chi-Chun
INTERSPEECH 2020, 2020, : 536 - 540
[9] ROBUST AND FINE-GRAINED PROSODY CONTROL OF END-TO-END SPEECH SYNTHESIS
Lee, Younggun
Kim, Taesu
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5911 - 5915
[10] Unsupervised Learning of Disentangled Speech Content and Style Representation
Tjandra, Andros
Pang, Ruoming
Zhang, Yu
Karita, Shigeki
INTERSPEECH 2021, 2021, : 4089 - 4093

← 1 2 3 4 5 →