Semantic dependency and local convolution for enhancing naturalness and tone in text-to-speech synthesis

被引:0
|
作者
Jiang, Chenglong [1 ]
Gao, Ying [1 ]
Ng, Wing W. Y. [1 ]
Zhou, Jiyong [1 ]
Zhong, Jinghui [1 ]
Zhen, Hongzhong [1 ]
Hu, Xiping [2 ]
机构
[1] South China Univ Technol, Guangzhou 511442, Peoples R China
[2] Shenzhen MSU BIT Univ, Shenzhen 518172, Peoples R China
关键词
Semantic dependency; Local convolution; Tone; Naturalness; Text-to-speech synthesis;
D O I
10.1016/j.neucom.2024.128430
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-attention-based networks have become increasingly popular due to their exceptional performance in parallel training and global context modeling. However, it may fall short of capturing local dependencies, particularly in datasets with strong local correlations. To address this challenge, we propose a novel method that utilizes semantic dependency to extract linguistic information from the original text. The semantic relationship between nodes serves as prior knowledge to refine the self-attention distribution. Additionally, to better fuse local contextual information, we introduce a one-dimensional convolution neural network to generate the query and value matrices in the self-attention mechanism, taking advantage of the strong correlation between input characters. We apply this variant of the self-attention network to text-to-speech tasks and propose a non-autoregressive neural text-to-speech model. To enhance pronunciation accuracy, we separate tones from phonemes as independent features in model training. Experimental results show that our model yields good performance in speech synthesis. Specifically, the proposed method significantly improves the processing of pause, stress, and intonation in speech.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Multilingual text analysis for text-to-speech synthesis
    Bell Lab, Murray Hill, United States
    International Conference on Spoken Language Processing, ICSLP, Proceedings, 1996, 3 : 1365 - 1368
  • [22] Multilingual text analysis for text-to-speech synthesis
    Sproat, R
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1365 - 1368
  • [23] Modeling stylized invariance and local variability of prosody in text-to-speech synthesis
    Chu, Min
    Zhao, Yong
    Chang, Eric
    SPEECH COMMUNICATION, 2006, 48 (06) : 716 - 726
  • [24] A hybrid model for text-to-speech synthesis
    Violaro, F
    Boeffard, O
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (05): : 426 - 434
  • [25] Environment Aware Text-to-Speech Synthesis
    Tan, Daxin
    Zhang, Guangyan
    Lee, Tan
    INTERSPEECH 2022, 2022, : 481 - 485
  • [26] Text-to-speech synthesis integrated circuit
    Baskaya, IF
    Aktan, O
    Dündar, G
    PROCEEDINGS OF THE IEEE 12TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, 2004, : 653 - 656
  • [27] PHONETIC KNOWLEDGE IN TEXT-TO-SPEECH SYNTHESIS
    van Santen, Jan P. H.
    INTEGRATION OF PHONETIC KNOWLEDGE IN SPEECH TECHNOLOGY, 2005, 25 : 149 - 166
  • [28] Prosody generation in text-to-speech conversion using dependency graphs
    Lindstrom, A
    Bretan, I
    Ljungqvist, M
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1341 - 1344
  • [29] CLUSTERING OF DURATION PATTERNS IN SPEECH FOR TEXT-TO-SPEECH SYNTHESIS
    Sreelekshmi, K. S.
    Gopinath, Deepa P.
    2012 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2012, : 1122 - 1127
  • [30] Enhancing Sequence-to-Sequence Text-to-Speech with Morphology
    Taylor, Jason
    Richmond, Korin
    INTERSPEECH 2020, 2020, : 1738 - 1742