On the localness modeling for the self-attention based end-to-end speech synthesis

被引:0
|
作者
Yang, Shan [1 ]
Lu, Heng [2 ]
Kang, Shiyin [2 ]
Xue, Liumeng [1 ]
Xiao, Jinba [1 ]
Su, Dan [2 ]
Xie, Lei [1 ]
Yu, Dong [2 ]
机构
[1] Audio, Speech and Language Processing Group (ASLP@NPU), National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University, Xi'an, China
[2] Tencent AI Lab, China
关键词
Gaussian distribution - Recurrent neural networks;
D O I
暂无
中图分类号
学科分类号
摘要
Attention based end-to-end speech synthesis achieves better performance in both prosody and quality compared to the conventional front-end–back-end structure. But training such end-to-end framework is usually time-consuming because of the use of recurrent neural networks. To enable parallel calculation and long-range dependency modeling, a solely self-attention based framework named Transformer is proposed recently in the end-to-end family. However, it lacks position information in sequential modeling, so that the extra position representation is crucial to achieve good performance. Besides, the weighted sum form of self-attention is conducted over the whole input sequence when computing latent representation, which may disperse the attention to the whole input sequence other than focusing on the more important neighboring input states, resulting in generation errors. In this paper, we introduce two localness modeling methods to enhance the self-attention based representation for speech synthesis, which maintain the abilities of parallel computation and global-range dependency modeling in self-attention while improving the generation stability. We systematically analyze the solely self-attention based end-to-end speech synthesis framework, and unveil the importance of local context. Then we add the proposed relative-position-aware method to enhance local edges and experiment with different architectures to examine the effectiveness of localness modeling. In order to achieve query-specific window and discard the hyper-parameter of the relative-position-aware approach, we further conduct Gaussian-based bias to enhance localness. Experimental results indicate that the two proposed localness enhanced methods can both improve the performance of the self-attention model, especially when applied to the encoder part. And the query-specific window of Gaussian bias approach is more robust compared with the fixed relative edges. © 2020 Elsevier Ltd
引用
收藏
页码:121 / 130
相关论文
共 50 条
  • [41] ATTENTION-BASED END-TO-END SPEECH RECOGNITION ON VOICE SEARCH
    Shan, Changhao
    Zhang, Junbo
    Wang, Yujun
    Xie, Lei
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4764 - 4768
  • [42] End-to-End Binaural Speech Synthesis
    Huang, Wen-Chin
    Markovic, Dejan
    Gebru, Israel D.
    Menon, Anjali
    Richard, Alexander
    INTERSPEECH 2022, 2022, : 1218 - 1222
  • [43] Hash Self-Attention End-to-End Network for Sketch-Based 3D Shape Retrieval
    Zhao X.
    Pan X.
    Liu F.
    Zhang S.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2021, 33 (05): : 798 - 805
  • [44] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
    Liu, Da-Rong
    Yang, Chi-Yu
    Wu, Szu-Lin
    Lee, Hung-Yi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
  • [45] An End-to-End Blind Image Quality Assessment Method Using a Recurrent Network and Self-Attention
    Zhou, Mingliang
    Lan, Xuting
    Wei, Xuekai
    Liao, Xingran
    Mao, Qin
    Li, Yutong
    Wu, Chao
    Xiang, Tao
    Fang, Bin
    IEEE TRANSACTIONS ON BROADCASTING, 2023, 69 (02) : 369 - 377
  • [46] Emphatic Speech Synthesis and Control Based on Characteristic Transferring in End-to-End Speech Synthesis
    Wang, Mu
    Wu, Zhiyong
    Wu, Xixin
    Meng, Helen
    Kang, Shiyin
    Jia, Jia
    Cai, Lianhong
    2018 FIRST ASIAN CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII ASIA), 2018,
  • [47] DensSiam: End-to-End Densely-Siamese Network with Self-Attention Model for Object Tracking
    Abdelpakey, Mohamed H.
    Shehata, Mohamed S.
    Mohamed, Mostafa M.
    ADVANCES IN VISUAL COMPUTING, ISVC 2018, 2018, 11241 : 463 - 473
  • [48] End-to-End Speech Synthesis for the Serbian Language Based on Tacotron
    Nosek, Tijana
    Suzic, Sinisa
    Secujski, Milan
    Stanojevic, Vuk
    Pekar, Darko
    Delic, Vlado
    SPEECH AND COMPUTER, SPECOM 2024, PT I, 2025, 15299 : 219 - 229
  • [49] Attention-Based End-to-End Named Entity Recognition from Speech
    Porjazovski, Dejan
    Leinonen, Juho
    Kurimo, Mikko
    TEXT, SPEECH, AND DIALOGUE, TSD 2021, 2021, 12848 : 469 - 480
  • [50] CHARACTER-AWARE ATTENTION-BASED END-TO-END SPEECH RECOGNITION
    Meng, Zhong
    Gaur, Yashesh
    Li, Jinyu
    Gong, Yifan
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 949 - 955