Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS

被引:2
|
作者
An, Xiaochun [1 ]
Soong, Frank K. [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Microsoft China, Beijing, Peoples R China
来源
关键词
neural TTS; style transfer; style distortion; cycle consistency; disjoint datasets;
D O I
10.21437/Interspeech.2021-1407
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
End-to-end neural TTS training has shown improved performance in speech style transfer. However, the improvement is still limited by the training data in both target styles and speakers. Inadequate style transfer performance occurs when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach to style transfer for both seen and unseen styles, with disjoint, multi-style datasets, i.e., datasets of different styles are recorded, each individual style is by one speaker with multiple utterances. To encode the style information, we adopt an inverse autoregressive flow (IAF) structure to improve the variational inference. The whole system is optimized to minimize a weighed sum of four different loss functions: 1) a reconstruction loss to measure the distortions in both source and target reconstructions; 2) an adversarial loss to "fool" a well-trained discriminator; 3) a style distortion loss to measure the expected style loss after the transfer; 4) a cycle consistency loss to preserve the speaker identity of the source after the transfer. Experiments demonstrate, both objectively and subjectively, the effectiveness of the proposed approach for seen and unseen style transfer tasks. The performance of the new approach is better and more robust than those of four baseline systems of the prior art.
引用
收藏
页码:4688 / 4692
页数:5
相关论文
共 50 条
  • [1] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
    Liu, Da-Rong
    Yang, Chi-Yu
    Wu, Szu-Lin
    Lee, Hung-Yi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
  • [2] Cycle consistent network for end-to-end style transfer TTS training
    Xue, Liumeng
    Pan, Shifeng
    He, Lei
    Xie, Lei
    Soong, Frank K.
    NEURAL NETWORKS, 2021, 140 : 223 - 236
  • [3] Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
    Wang, Yuxuan
    Stanton, Daisy
    Zhang, Yu
    Skerry-Ryan, R. J.
    Battenberg, Eric
    Shor, Joel
    Xiao, Ying
    Ren, Fei
    Jia, Ye
    Saurous, Rif A.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80, 2018, 80
  • [4] LEARNING LATENT REPRESENTATIONS FOR STYLE CONTROL AND TRANSFER IN END-TO-END SPEECH SYNTHESIS
    Zhang, Ya-Jie
    Pan, Shifeng
    He, Lei
    Ling, Zhen-Hua
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6945 - 6949
  • [5] Improving transfer of expressivity for end-to-end multispeaker text-to-speech synthesis
    Kulkarni, Ajinkya
    Colotte, Vincent
    Jouvet, Denis
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 31 - 35
  • [6] Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation
    Wang, Changhan
    Pino, Juan
    Gu, Jiatao
    INTERSPEECH 2020, 2020, : 4731 - 4735
  • [7] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
    Pamisetty, Giridhar
    Murty, K. Sri Rama
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2023, 42 (01) : 361 - 384
  • [8] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
    Giridhar Pamisetty
    K. Sri Rama Murty
    Circuits, Systems, and Signal Processing, 2023, 42 : 361 - 384
  • [9] IMPROVING END-TO-END SPEECH SYNTHESIS WITH LOCAL RECURRENT NEURAL NETWORK ENHANCED TRANSFORMER
    Zheng, Yibin
    Li, Xinhui
    Xie, Fenglong
    Lu, Li
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6734 - 6738
  • [10] IMPROVING END-TO-END SPEECH RECOGNITION WITH POLICY LEARNING
    Zhou, Yingbo
    Xiong, Caiming
    Socher, Richard
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5819 - 5823