Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion

被引:6
|
作者
Kang, Xiao [1 ]
Huang, Hao [1 ,2 ]
Hu, Ying [1 ]
Huang, Zhihua [1 ]
机构
[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China
[2] Xinjiang Prov Key Lab Multilingual Informat Techn, Urumqi, Peoples R China
基金
国家重点研发计划;
关键词
Voice conversion; Zero-shot; VQ-VAE; Connectionist temporal classification; NEURAL-NETWORKS;
D O I
10.1016/j.dsp.2021.103110
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Vector quantized variational autoencoder (VQ-VAE) has recently become an increasingly popular method in non-parallel zero-shot voice conversion (VC). The reason behind is that VQ-VAE is capable of disentangling the content and the speaker representations from the speech by using a content encoder and a speaker encoder, which is suitable for the VC task that makes the speech of a source speaker sound like the speech of the target speaker without changing the linguistic content. However, the converted speech is not satisfying because it is difficult to disentangle the pure content representations from the acoustic features due to the lack of linguistic supervision for the content encoder. To address this issue, under the framework of VQ-VAE, connectionist temporal classification (CTC) loss is proposed to guide the content encoder to learn pure content representations by using an auxiliary network. Based on the fact that the CTC loss is not affected by the sequence length of the output of the content encoder, adding the linguistic supervision to the content encoder can be much easier. This non-parallel many-to-many voice conversion model is named as CTC-VQ-VAE. VC experiments on the CMU ARCTIC and VCTK corpus are carried out to evaluate the proposed method. Both the objective and the subjective results show that the proposed approach significantly improves the speech quality and speaker similarity of the converted speech, compared with the traditional VQ-VAE method. (C) 2021 Elsevier Inc. All rights reserved.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning
    Wang, Shijun
    Borth, Damian
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [32] Zero-Shot Classification Based on Word Vector Enhancement and Distance Metric Learning
    Zhang, Ji
    Chen, Yu
    Zhai, Yongjie
    IEEE ACCESS, 2020, 8 (08): : 102292 - 102302
  • [33] Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment
    Sheng, Zheng-Yan
    Ai, Yang
    Chen, Yan-Nian
    Ling, Zhen-Hua
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8443 - 8452
  • [34] SLMGAN: EXPLOITING SPEECH LANGUAGE MODEL REPRESENTATIONS FOR UNSUPERVISED ZERO-SHOT VOICE CONVERSION IN GANS
    Li, Yinghao Aaron
    Han, Cong
    Mesgarani, Nima
    2023 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, WASPAA, 2023,
  • [35] Hybrid attribute conditional adversarial denoising autoencoder for zero-shot classification of mechanical intelligent fault diagnosis
    Lv, Haixin
    Chen, Jinglong
    Pan, Tongyang
    Zhou, Zitong
    APPLIED SOFT COMPUTING, 2020, 95 (95)
  • [36] Embracing Diversity: Interpretable Zero-shot classification beyond one vector per class
    Moayeri, Mazda
    Rabbat, Michael
    Ibrahim, Mark
    Bouchacourt, Diane
    PROCEEDINGS OF THE 2024 ACM CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, ACM FACCT 2024, 2024, : 2302 - 2321
  • [37] ZeroAE: Pre-trained Language Model based Autoencoder for Transductive Zero-shot Text Classification
    Guo, Kaihao
    Yu, Hang
    Liao, Cong
    Li, Jianguo
    Zhang, Haipeng
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 3202 - 3219
  • [38] Zero-shot Micro-video Classification with Neural Variational Inference in Graph Prototype Network
    Chen, Junyang
    Wang, Jialong
    Dai, Zhijiang
    Wu, Huisi
    Wang, Mengzhu
    Zhang, Qin
    Wang, Huan
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 966 - 974
  • [39] NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH
    Zhao, Botao
    Zhang, Xulong
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4293 - 4297
  • [40] Two-stage and Self-supervised Voice Conversion for Zero-Shot Dysarthric Speech Reconstruction
    Liu, Dong
    Lin, Yueqian
    Bu, Hui
    Li, Ming
    2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024, 2024, : 423 - 427