Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation-based Voice Conversion

被引:1
|
作者
Zhao, Xintao [1 ]
Wang, Shuai [2 ]
Chao, Yang [2 ]
Wu, Zhiyong [3 ]
Meng, Helen [4 ]
机构
[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Tencent Inc, Lightspeed & Quantum Studios, Shenzhen, Peoples R China
[3] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[4] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
voice conversion; self-supervised learning; adversarial training;
D O I
10.1109/ICME55011.2023.00291
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, recognition-synthesis-based methods have been quite popular with voice conversion (VC). By introducing linguistics features with good disentangling characters extracted from an automatic speech recognition (ASR) model, the VC performance achieved considerable breakthroughs. Recently, self-supervised learning (SSL) methods trained with a large-scale unannotated speech corpus have been applied to downstream tasks focusing on the content information, which is suitable for VC tasks. However, a huge amount of speaker information in SSL representations degrades timbre similarity and the quality of converted speech significantly. To address this problem, we proposed a high-similarity any-to-one voice conversion method with the input of SSL representations. We incorporated adversarial training mechanisms in the synthesis module using external unannotated corpora. Two auxiliary discriminators were trained to distinguish whether a sequence of mel-spectrograms has been converted by the acoustic model and whether a sequence of content embeddings contains speaker information from external corpora. Experimental results show that our proposed method achieves comparable similarity and higher naturalness than the supervised method, which needs a huge amount of annotated corpora for training and is applicable to improve similarity for VC methods with other SSL representations as input.
引用
收藏
页码:1691 / 1696
页数:6
相关论文
共 32 条
  • [1] Augmentation Adversarial Training for Self-Supervised Speaker Representation Learning
    Kang, Jingu
    Huh, Jaesung
    Heo, Hee Soo
    Chung, Joon Son
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1253 - 1262
  • [2] A Comparative Study of Self-Supervised Speech Representation Based Voice Conversion
    Huang, Wen-Chin
    Yang, Shu-Wen
    Hayashi, Tomoki
    Toda, Tomoki
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1308 - 1318
  • [3] Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion
    Huang, Wen-Chin
    Luo, Hao
    Hwang, Hsin-Te
    Lo, Chen-Chou
    Peng, Yu-Huai
    Tsao, Yu
    Wang, Hsin-Min
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2020, 4 (04): : 468 - 479
  • [4] Self-supervised Representation Learning Using 360° Data
    Li, Junnan
    Liu, Jianquan
    Wong, Yongkang
    Nishimura, Shoji
    Kankanhalli, Mohan S.
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 998 - 1006
  • [5] Non-Parallel Voice Conversion Using Cycle-Consistent Adversarial Networks with Self-Supervised Representations
    Chun, Chanjun
    Lee, Young Han
    Lee, Geon Woo
    Jeon, Moongu
    Kim, Hong Kook
    2023 IEEE 20TH CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE, CCNC, 2023,
  • [6] Speaker recognition using isomorphic graph attention network based pooling on self-supervised representation *
    Ge, Zirui
    Xu, Xinzhou
    Guo, Haiyan
    Wang, Tingting
    Yang, Zhen
    APPLIED ACOUSTICS, 2024, 219
  • [7] S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation
    Zhu, Yizhe
    Min, Martin Renqiang
    Kadav, Asim
    Graf, Hans Peter
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 6537 - 6546
  • [8] Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning
    Wang, Shijun
    Borth, Damian
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [9] Combining Self-supervised Learning and Adversarial Training based Domain Adaptation for Speaker Verification
    Chen, Zhengyang
    Wang, Shuai
    Han, Bing
    Qian, Yanmin
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 701 - 705
  • [10] Self-supervised time-frequency representation based on generative adversarial networks
    Liu, Naihao
    Lei, Youbo
    Yang, Yang
    Wei, Shengtao
    Gao, Jinghuai
    Jiang, Xiudi
    GEOPHYSICS, 2023, 88 (04) : IM87 - IM99