Speaker-Independent Emotional Voice Conversion via Disentangled Representations

被引:2
|
作者
Chen, Xunquan [1 ]
Xu, Xuexin [2 ]
Chen, Jinhui [3 ]
Zhang, Zhizhong [2 ]
Takiguchi, Tetsuya [1 ]
Hancock, Edwin R. [4 ]
机构
[1] Kobe Univ, Grad Sch Syst Informat, Kobe 6578501, Japan
[2] Xiamen Univ, Xiamen 361005, Peoples R China
[3] Prefectural Univ Hiroshima, Hiroshima 7348558, Japan
[4] Univ York, Dept Comp Sci, York YO10 5GH, England
关键词
Emotional voice conversion; disentangled representation learning; adversarial learning; mutual information; speaker-independent; AUGMENTATION; NETWORKS; STARGAN; TIME;
D O I
10.1109/TMM.2022.3222646
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Emotional Voice Conversion (EVC) technology aims to transfer emotional state in speech while keeping the linguistic information and speaker identity unchanged. Prior studies on EVC have been limited to perform the conversion for a specific speaker or a predefined set of multiple speakers seen in the training stage. When encountering arbitrary speakers that may be unseen during training (outside the set of speakers used in training), existing EVC methods have limited conversion capabilities. However, converting the emotion of arbitrary speakers, even those unseen during the training procedure, in one model is much more challenging and much more attractive in real-world scenarios. To address this problem, in this study, we propose SIEVC, a novel speaker-independent emotional voice conversion framework for arbitrary speakers via disentangled representation learning. The proposed method employs the autoencoder framework to disentangle the emotion information and emotion-independent information of each input speech into separated representation spaces. To achieve better disentanglement, we incorporate mutual information minimization into the training process. In addition, adversarial training is applied to enhance the quality of the generated audio signals. Finally, speaker-independent EVC for arbitrary speakers could be achieved by only replacing the emotion representations of source speech with the target ones. The experimental results demonstrate that the proposed EVC model outperforms the baseline models in terms of objective and subjective evaluation for both seen and unseen speakers.
引用
收藏
页码:7480 / 7493
页数:14
相关论文
共 50 条
  • [1] Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion
    Zhou, Kun
    Sisman, Berrak
    Zhang, Mingyang
    Li, Haizhou
    INTERSPEECH 2020, 2020, : 3416 - 3420
  • [2] SPEAKER-INDEPENDENT LIPREADING BY DISENTANGLED REPRESENTATION LEARNING
    Zhang, Qun
    Wang, Shilin
    Chen, Gongliang
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2493 - 2497
  • [3] VOICE CONVERSION IN TIME-INVARIANT SPEAKER-INDEPENDENT SPACE
    Nakashika, Toru
    Takiguchi, Tetsuya
    Ariki, Yasuo
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [4] Decoupling Speaker-Independent Emotions for Voice Conversion via Source-Filter Networks
    Luo, Zhaojie
    Lin, Shoufeng
    Liu, Rui
    Baba, Jun
    Yoshikawa, Yuichiro
    Ishiguro, Hiroshi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 11 - 24
  • [5] Voice-to-phoneme conversion algorithms for speaker-independent voice-tag applications in embedded platforms
    Cheng, YM
    Ma, CX
    Melnar, L
    2005 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2005, : 403 - 408
  • [6] VOICE CONVERSION USING DEEP NEURAL NETWORKS WITH SPEAKER-INDEPENDENT PRE-TRAINING
    Mohammadi, Seyed Hamidreza
    Kain, Alexander
    2014 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY SLT 2014, 2014, : 19 - 23
  • [7] Speaker-independent HMM-based Voice Conversion Using Quantized Fundamental Frequency
    Nose, Takashi
    Kobayashi, Takao
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1724 - 1727
  • [8] Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations
    Zhang, Jing-Xuan
    Ling, Zhen-Hua
    Dai, Li-Rong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 540 - 552
  • [9] Speaker-independent HMM-based voice conversion using adaptive quantization of the fundamental frequency
    Nose, Takashi
    Kobayashi, Takao
    SPEECH COMMUNICATION, 2011, 53 (07) : 973 - 985
  • [10] Practical speaker-independent voice recognition using segmental features
    Kimura, T
    Ashida, A
    Niyada, K
    ELECTRONICS AND COMMUNICATIONS IN JAPAN PART II-ELECTRONICS, 2004, 87 (02): : 73 - 81