Zero-shot voice conversion based on feature disentanglement

被引:0
|
作者
Guo, Na [1 ]
Wei, Jianguo [1 ]
Li, Yongwei [2 ]
Lu, Wenhuan [1 ]
Tao, Jianhua [3 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[2] Chinese Acad Sci, Inst Psychol, CAS Key Lab Behav Sci, Beijing, Peoples R China
[3] Tsinghua Univ, Dept Automat, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Zero-shot voice conversion; Mixed speaker layer normalization; Adaptive attention weight normalization; Dynamic convolution; SPARSE REPRESENTATION; ADAPTATION; SPEAKER;
D O I
10.1016/j.specom.2024.103143
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed
    Chen, Meiying
    Duan, Zhiyao
    INTERSPEECH 2023, 2023, : 2098 - 2102
  • [22] TRAINING ROBUST ZERO-SHOT VOICE CONVERSION MODELS WITH SELF-SUPERVISED FEATURES
    Trung Dang
    Dung Tran
    Chin, Peter
    Koishida, Kazuhito
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6557 - 6561
  • [23] StreamVoice plus : Evolving Into End-to-End Streaming Zero-Shot Voice Conversion
    Wang, Zhichao
    Chen, Yuanzhe
    Wang, Xinsheng
    Xie, Lei
    Wang, Yuping
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 3000 - 3004
  • [24] CA-VC: A Novel Zero-Shot Voice Conversion Method With Channel Attention
    Xiao, Ruitong
    Xing, Xiaofen
    Yang, Jichen
    Xu, Xiangmin
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 800 - 807
  • [25] End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions
    Kang, Wonjune
    Hasegawa-Johnson, Mark
    Roy, Deb
    INTERSPEECH 2023, 2023, : 2303 - 2307
  • [26] Attribute disentanglement and re-entanglement for generalized zero-shot learning
    Zhou, Quan
    Liang, Yucuan
    Zhang, Zhenqi
    Cao, Wenming
    PATTERN RECOGNITION LETTERS, 2024, 186 : 1 - 7
  • [27] Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning
    Wang, Shijun
    Borth, Damian
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [28] PMGNet: Disentanglement and entanglement benefit mutually for compositional zero-shot learning
    Liu, Yu
    Li, Jianghao
    Zhang, Yanyi
    Jia, Qi
    Wang, Weimin
    Pu, Nan
    Sebe, Nicu
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249
  • [29] Contrastive semantic disentanglement in latent space for generalized zero-shot learning
    Fan, Wentao
    Liang, Chen
    Wang, Tian
    KNOWLEDGE-BASED SYSTEMS, 2022, 257
  • [30] Contrastive semantic disentanglement in latent space for generalized zero-shot learning
    Fan, Wentao
    Liang, Chen
    Wang, Tian
    Knowledge-Based Systems, 2022, 257