Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations

被引:2
|
作者
Benaroya, Laurent [1 ]
Obin, Nicolas [1 ]
Roebel, Axel [1 ]
机构
[1] Sorbonne Univ, Anal Synth Team, STMS, IRCAM,CNRS,French Minist Culture, F-75004 Paris, France
关键词
voice conversion; attribute manipulation; representation learning; information disentanglement; adversarial learning; cross-entropy; CONVERSION;
D O I
10.3390/e25020375
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable breakthroughs with the capacity to falsify a voice identity using a small amount of data with a highly realistic rendering. This paper goes beyond voice identity manipulation and presents an original neural architecture that allows the manipulation of voice attributes (e.g., gender and age). The proposed architecture is inspired by the fader network, transferring the same ideas to voice manipulation. The information conveyed by the speech signal is disentangled into interpretative voice attributes by means of minimizing adversarial loss to make the encoded information mutually independent while preserving the capacity to generate a speech signal from the disentangled codes. During inference for voice conversion, the disentangled voice attributes can be manipulated and the speech signal can be generated accordingly. For experimental evaluation, the proposed method is applied to the task of voice gender conversion using the freely available VCTK dataset. Quantitative measurements of mutual information between the variables of speaker identity and speaker gender show that the proposed architecture can learn gender-independent representation of speakers. Additional measurements of speaker recognition indicate that speaker identity can be recognized accurately from the gender-independent representation. Finally, a subjective experiment conducted on the task of voice gender manipulation shows that the proposed architecture can convert voice gender with very high efficiency and good naturalness.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Disentangled Adversarial Transfer Learning for Physiological Biosignals
    Han, Mo
    Ozdenizci, Ozan
    Wang, Ye
    Koike-Akino, Toshiaki
    Erdogmus, Deniz
    42ND ANNUAL INTERNATIONAL CONFERENCES OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY: ENABLING INNOVATIVE TECHNOLOGIES FOR GLOBAL HEALTHCARE EMBC'20, 2020, : 422 - 425
  • [22] Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations
    Chou, Ju-chieh
    Yeh, Cheng-chieh
    Lee, Hung-yi
    Lee, Lin-shan
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 501 - 505
  • [23] Learning Debiased and Disentangled Representations for Semantic Segmentation
    Chu, Sanghyeok
    Kim, Dongwan
    Han, Bohyung
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [24] Learning Disentangled Representations of Video with Missing Data
    Massague, Armand Comas
    Zhang, Chi
    Feric, Zlatan
    Camps, Octavia
    Yu, Rose
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [25] LEARNING DISENTANGLED FEATURE REPRESENTATIONS FOR ANOMALY DETECTION
    Lee, Wei-Yu
    Wang, Yu-Chiang Frank
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 2156 - 2160
  • [26] Learning Disentangled Representations via Independent Subspaces
    Awiszus, Maren
    Ackermann, Hanno
    Rosenhahn, Bodo
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 560 - 568
  • [27] Unsupervised Learning of Disentangled Representations from Video
    Denton, Emily
    Birodkar, Vighnesh
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [28] Learning Disentangled Multimodal Representations for the Fashion Domain
    Saha, Amrita
    Nawhal, Megha
    Khaprat, Mitesh M.
    Raykar, Vikas C.
    2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 557 - 566
  • [29] Black-box adversarial attacks by manipulating image attributes
    Wei, Xingxing
    Guo, Ying
    Li, Bo
    INFORMATION SCIENCES, 2021, 550 : 285 - 296
  • [30] Learning Disentangled Representations Using Dormant Variations
    Palaniappan, Kanmani
    Ushasukhanya, S.
    Malleswari, T. Y. J. Naga
    Selvaraj, Prabha
    Burugari, Vijay Kumar
    2022 9TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE, ISCMI, 2022, : 31 - 35