Zero-shot voice conversion based on feature disentanglement

被引：0

作者：

Guo, Na ^{[1
]}

Wei, Jianguo ^{[1
]}

Li, Yongwei ^{[2
]}

Lu, Wenhuan ^{[1
]}

Tao, Jianhua ^{[3
]}

机构：

[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China

[2] Chinese Acad Sci, Inst Psychol, CAS Key Lab Behav Sci, Beijing, Peoples R China

[3] Tsinghua Univ, Dept Automat, Beijing, Peoples R China

来源：

SPEECH COMMUNICATION | 2024年 / 165卷

基金：

中国国家自然科学基金;

关键词：

Zero-shot voice conversion; Mixed speaker layer normalization; Adaptive attention weight normalization; Dynamic convolution; SPARSE REPRESENTATION; ADAPTATION; SPEAKER;

D O I：

10.1016/j.specom.2024.103143

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models.

引用

页数：10

共 50 条

[21] ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed
Chen, Meiying
Duan, Zhiyao
INTERSPEECH 2023, 2023, : 2098 - 2102
[22] TRAINING ROBUST ZERO-SHOT VOICE CONVERSION MODELS WITH SELF-SUPERVISED FEATURES
Trung Dang
Dung Tran
Chin, Peter
Koishida, Kazuhito
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6557 - 6561
[23] StreamVoice plus : Evolving Into End-to-End Streaming Zero-Shot Voice Conversion
Wang, Zhichao
Chen, Yuanzhe
Wang, Xinsheng
Xie, Lei
Wang, Yuping
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 3000 - 3004
[24] CA-VC: A Novel Zero-Shot Voice Conversion Method With Channel Attention
Xiao, Ruitong
Xing, Xiaofen
Yang, Jichen
Xu, Xiangmin
2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 800 - 807
[25] End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions
Kang, Wonjune
Hasegawa-Johnson, Mark
Roy, Deb
INTERSPEECH 2023, 2023, : 2303 - 2307
[26] Attribute disentanglement and re-entanglement for generalized zero-shot learning
Zhou, Quan
Liang, Yucuan
Zhang, Zhenqi
Cao, Wenming
PATTERN RECOGNITION LETTERS, 2024, 186 : 1 - 7
[27] Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning
Wang, Shijun
Borth, Damian
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[28] PMGNet: Disentanglement and entanglement benefit mutually for compositional zero-shot learning
Liu, Yu
Li, Jianghao
Zhang, Yanyi
Jia, Qi
Wang, Weimin
Pu, Nan
Sebe, Nicu
COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249
[29] Contrastive semantic disentanglement in latent space for generalized zero-shot learning
Fan, Wentao
Liang, Chen
Wang, Tian
KNOWLEDGE-BASED SYSTEMS, 2022, 257
[30] Contrastive semantic disentanglement in latent space for generalized zero-shot learning
Fan, Wentao
Liang, Chen
Wang, Tian
Knowledge-Based Systems, 2022, 257

← 1 2 3 4 5 →