Speaker-Independent Emotional Voice Conversion via Disentangled Representations

被引:2
|
作者
Chen, Xunquan [1 ]
Xu, Xuexin [2 ]
Chen, Jinhui [3 ]
Zhang, Zhizhong [2 ]
Takiguchi, Tetsuya [1 ]
Hancock, Edwin R. [4 ]
机构
[1] Kobe Univ, Grad Sch Syst Informat, Kobe 6578501, Japan
[2] Xiamen Univ, Xiamen 361005, Peoples R China
[3] Prefectural Univ Hiroshima, Hiroshima 7348558, Japan
[4] Univ York, Dept Comp Sci, York YO10 5GH, England
关键词
Emotional voice conversion; disentangled representation learning; adversarial learning; mutual information; speaker-independent; AUGMENTATION; NETWORKS; STARGAN; TIME;
D O I
10.1109/TMM.2022.3222646
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Emotional Voice Conversion (EVC) technology aims to transfer emotional state in speech while keeping the linguistic information and speaker identity unchanged. Prior studies on EVC have been limited to perform the conversion for a specific speaker or a predefined set of multiple speakers seen in the training stage. When encountering arbitrary speakers that may be unseen during training (outside the set of speakers used in training), existing EVC methods have limited conversion capabilities. However, converting the emotion of arbitrary speakers, even those unseen during the training procedure, in one model is much more challenging and much more attractive in real-world scenarios. To address this problem, in this study, we propose SIEVC, a novel speaker-independent emotional voice conversion framework for arbitrary speakers via disentangled representation learning. The proposed method employs the autoencoder framework to disentangle the emotion information and emotion-independent information of each input speech into separated representation spaces. To achieve better disentanglement, we incorporate mutual information minimization into the training process. In addition, adversarial training is applied to enhance the quality of the generated audio signals. Finally, speaker-independent EVC for arbitrary speakers could be achieved by only replacing the emotion representations of source speech with the target ones. The experimental results demonstrate that the proposed EVC model outperforms the baseline models in terms of objective and subjective evaluation for both seen and unseen speakers.
引用
收藏
页码:7480 / 7493
页数:14
相关论文
共 50 条
  • [41] SPEAKER-INDEPENDENT LIPS AND TONGUE VISUALIZATION OF VOWELS
    Li, Hao
    Yang, Minghao
    Tao, Jianhua
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 8106 - 8110
  • [42] ALGORITHM TURNS TO SPEAKER-INDEPENDENT WORD RECOGNITION
    OHR, S
    ELECTRONIC DESIGN, 1983, 31 (19) : 40 - 41
  • [43] On Speaker-Independent, Speaker-Dependent, and Speaker-Adaptive Speech Recognition
    Huang, Xuedong
    Lee, Kai-Fu
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1993, 1 (02): : 150 - 157
  • [44] Speaker-Independent Microphone Identification in Noisy Conditions
    Giganti, Antonio
    Cuccovillo, Luca
    Bestagini, Paolo
    Aichroth, Patrick
    Tubaro, Stefano
    2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 1047 - 1051
  • [45] A continuous speaker-independent Putonghua dictation system
    Chen, CJ
    Gopinath, RA
    Monkowski, MD
    Picheny, MA
    ICSP '96 - 1996 3RD INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, PROCEEDINGS, VOLS I AND II, 1996, : 821 - 824
  • [46] Speaker-independent Malay isolated sounds recognition
    Ting, HN
    Yunus, J
    Wong, LC
    ICONIP'02: PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON NEURAL INFORMATION PROCESSING: COMPUTATIONAL INTELLIGENCE FOR THE E-AGE, 2002, : 2405 - 2408
  • [47] Speaker-independent Mandarin polysyllabic word recognition
    Chang, HY
    Chen, B
    Chou, CS
    Liu, CM
    ISSPA 96 - FOURTH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, PROCEEDINGS, VOLS 1 AND 2, 1996, : 329 - 332
  • [48] Hardware design of a speaker-independent speech recognizer
    Yang, Wu-Ji
    Wang, Hsiao-Chuan
    Chung-kuo Kung Ch'eng Hsueh K'an/Journal of the Chinese Institute of Engineers, 1988, 11 (04): : 361 - 371
  • [49] Highly Intelligible Speaker-Independent Articulatory Synthesis
    McGhee, Charles
    Knill, Kate
    Gales, Mark
    INTERSPEECH 2024, 2024, : 3375 - 3379
  • [50] SPEAKER-INDEPENDENT BRAIN ENHANCED SPEECH DENOISING
    Hosseini, Maryam
    Celotti, Luca
    Plourde, Eric
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 1310 - 1314