Speaker-Independent Emotional Voice Conversion via Disentangled Representations

被引:2
|
作者
Chen, Xunquan [1 ]
Xu, Xuexin [2 ]
Chen, Jinhui [3 ]
Zhang, Zhizhong [2 ]
Takiguchi, Tetsuya [1 ]
Hancock, Edwin R. [4 ]
机构
[1] Kobe Univ, Grad Sch Syst Informat, Kobe 6578501, Japan
[2] Xiamen Univ, Xiamen 361005, Peoples R China
[3] Prefectural Univ Hiroshima, Hiroshima 7348558, Japan
[4] Univ York, Dept Comp Sci, York YO10 5GH, England
关键词
Emotional voice conversion; disentangled representation learning; adversarial learning; mutual information; speaker-independent; AUGMENTATION; NETWORKS; STARGAN; TIME;
D O I
10.1109/TMM.2022.3222646
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Emotional Voice Conversion (EVC) technology aims to transfer emotional state in speech while keeping the linguistic information and speaker identity unchanged. Prior studies on EVC have been limited to perform the conversion for a specific speaker or a predefined set of multiple speakers seen in the training stage. When encountering arbitrary speakers that may be unseen during training (outside the set of speakers used in training), existing EVC methods have limited conversion capabilities. However, converting the emotion of arbitrary speakers, even those unseen during the training procedure, in one model is much more challenging and much more attractive in real-world scenarios. To address this problem, in this study, we propose SIEVC, a novel speaker-independent emotional voice conversion framework for arbitrary speakers via disentangled representation learning. The proposed method employs the autoencoder framework to disentangle the emotion information and emotion-independent information of each input speech into separated representation spaces. To achieve better disentanglement, we incorporate mutual information minimization into the training process. In addition, adversarial training is applied to enhance the quality of the generated audio signals. Finally, speaker-independent EVC for arbitrary speakers could be achieved by only replacing the emotion representations of source speech with the target ones. The experimental results demonstrate that the proposed EVC model outperforms the baseline models in terms of objective and subjective evaluation for both seen and unseen speakers.
引用
收藏
页码:7480 / 7493
页数:14
相关论文
共 50 条
  • [31] Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion
    Du, Zongyang
    Sisman, Berrak
    Zhou, Kun
    Li, Haizhou
    INTERSPEECH 2022, 2022, : 2603 - 2607
  • [32] Investigating the contribution of speaker attributes to speaker separability using disentangled speaker representations
    Luu, Chau
    Renals, Steve
    Bell, Peter
    INTERSPEECH 2022, 2022, : 610 - 614
  • [33] Wavelet Analysis of Speaker Dependent and Independent Prosody for Voice Conversion
    Sisman, Berrak
    Li, Haizhou
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 52 - 56
  • [34] Speaker-independent expressive voice synthesis using learning-based hybrid network model
    Vekkot, Susmitha
    Gupta, Deepa
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2020, 23 (03) : 597 - 613
  • [36] Tone correctness improvement in speaker-independent average-voice-based Thai speech synthesis
    Chomphan, Suphattharachal
    Kobayashi, Takao
    SPEECH COMMUNICATION, 2009, 51 (04) : 330 - 343
  • [37] Predictor codebook for speaker-independent speech recognition
    Kawabata, Takeshi
    Systems and Computers in Japan, 1994, 25 (01): : 37 - 46
  • [38] Speaker-independent expressive voice synthesis using learning-based hybrid network model
    Susmitha Vekkot
    Deepa Gupta
    International Journal of Speech Technology, 2020, 23 : 597 - 613
  • [39] SPEAKER-INDEPENDENT DIGIT-RECOGNITION SYSTEM
    SAMBUR, MR
    RABINER, LR
    BELL SYSTEM TECHNICAL JOURNAL, 1975, 54 (01): : 81 - 102
  • [40] Speaker-independent Speech Inversion for Estimation of Nasalance
    Siriwardena, Yashish M.
    Espy-Wilson, Carol
    Boyce, Suzanne
    Tiede, Mark K.
    Oren, Liran
    INTERSPEECH 2023, 2023, : 4743 - 4747