Decoupling Speaker-Independent Emotions for Voice Conversion via Source-Filter Networks

被引:5
|
作者
Luo, Zhaojie [1 ]
Lin, Shoufeng [2 ]
Liu, Rui [3 ]
Baba, Jun [4 ]
Yoshikawa, Yuichiro [5 ]
Ishiguro, Hiroshi [5 ]
机构
[1] Osaka Univ, Inst Sci & Ind Res, Osaka 5670047, Japan
[2] Curtin Univ, Sch Elect Engn Comp & Math Sci, Bentley, WA 6102, Australia
[3] Inner Mongolia Univ, Sch Comp Sci, Hohhot 010021, Inner Mongolia, Peoples R China
[4] CyberAgent Inc, Shibuya Ku, Tokyo 1506121, Japan
[5] Osaka Univ, Grad Sch Engineer Sci, Toyonaka, Osaka 5608531, Japan
关键词
Timbre; Speech processing; Acoustics; Training; Codes; Rhythm; Larynx; Auto-encoder; emotional voice conversion; prosody; source-filter networks; valence arousal; SPEECH; PITCH; VALENCE; AROUSAL;
D O I
10.1109/TASLP.2022.3190715
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Emotional voice conversion (VC) aims to convert a neutral voice to an emotional one while retaining the linguistic information and speaker identity. We note that the decoupling of emotional features from other speech information (such as content, speaker identity, etc.) is the key to achieving promising performance. Some recent attempts of speech representation decoupling on the neutral speech cannot work well on the emotional speech, due to the more complex entanglement of acoustic properties in the latter. To address this problem, here we propose a novel Source-Filter-based Emotional VC model (SFEVC) to achieve proper filtering of speaker-independent emotion cues from both the timbre and pitch features. Our SFEVC model consists of multi-channel encoders, emotion separate encoders, pre-trained speaker-dependent encoders, and the corresponding decoder. Note that all encoder modules adopt a designed information bottleneck auto-encoder. Additionally, to further improve the conversion quality for various emotions, a novel training strategy based on the 2D Valence-Arousal (VA) space is proposed. Experimental results show that the proposed SFEVC along with a VA training strategy outperforms all baselines and achieves the state-of-the-art performance in speaker-independent emotional VC with nonparallel data.
引用
收藏
页码:11 / 24
页数:14
相关论文
共 50 条
  • [1] Speaker-Independent Emotional Voice Conversion via Disentangled Representations
    Chen, Xunquan
    Xu, Xuexin
    Chen, Jinhui
    Zhang, Zhizhong
    Takiguchi, Tetsuya
    Hancock, Edwin R.
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7480 - 7493
  • [2] VOICE CONVERSION USING DEEP NEURAL NETWORKS WITH SPEAKER-INDEPENDENT PRE-TRAINING
    Mohammadi, Seyed Hamidreza
    Kain, Alexander
    2014 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY SLT 2014, 2014, : 19 - 23
  • [3] VOICE CONVERSION IN TIME-INVARIANT SPEAKER-INDEPENDENT SPACE
    Nakashika, Toru
    Takiguchi, Tetsuya
    Ariki, Yasuo
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [4] Siamese decoupling network for speaker-independent lipreading
    Lu, Longbin
    Xu, Xuebin
    Fu, Jun
    JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (03)
  • [5] Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion
    Zhou, Kun
    Sisman, Berrak
    Zhang, Mingyang
    Li, Haizhou
    INTERSPEECH 2020, 2020, : 3416 - 3420
  • [6] Voice-to-phoneme conversion algorithms for speaker-independent voice-tag applications in embedded platforms
    Cheng, YM
    Ma, CX
    Melnar, L
    2005 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2005, : 403 - 408
  • [7] Speaker-independent HMM-based Voice Conversion Using Quantized Fundamental Frequency
    Nose, Takashi
    Kobayashi, Takao
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1724 - 1727
  • [8] Speaker-independent HMM-based voice conversion using adaptive quantization of the fundamental frequency
    Nose, Takashi
    Kobayashi, Takao
    SPEECH COMMUNICATION, 2011, 53 (07) : 973 - 985
  • [9] A pitch synchronous approach to design voice conversion system using source-filter correlation
    Laskar, Rabul
    Banerjee, Kalyan
    Talukdar, Fazal
    Rao, K.
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2012, 15 (03) : 419 - 431
  • [10] A Novel Source-Filter Stochastic Model for Voice Production
    Cataldo, E.
    Monteiro, L.
    Soize, C.
    JOURNAL OF VOICE, 2023, 37 (01) : 1 - 8