Semantic dependency and local convolution for enhancing naturalness and tone in text-to-speech synthesis

被引：0

作者：

Jiang, Chenglong ^{[1
]}

Gao, Ying ^{[1
]}

Ng, Wing W. Y. ^{[1
]}

Zhou, Jiyong ^{[1
]}

Zhong, Jinghui ^{[1
]}

Zhen, Hongzhong ^{[1
]}

Hu, Xiping ^{[2
]}

机构：

[1] South China Univ Technol, Guangzhou 511442, Peoples R China

[2] Shenzhen MSU BIT Univ, Shenzhen 518172, Peoples R China

来源：

NEUROCOMPUTING | 2024年 / 608卷

关键词：

Semantic dependency; Local convolution; Tone; Naturalness; Text-to-speech synthesis;

D O I：

10.1016/j.neucom.2024.128430

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-attention-based networks have become increasingly popular due to their exceptional performance in parallel training and global context modeling. However, it may fall short of capturing local dependencies, particularly in datasets with strong local correlations. To address this challenge, we propose a novel method that utilizes semantic dependency to extract linguistic information from the original text. The semantic relationship between nodes serves as prior knowledge to refine the self-attention distribution. Additionally, to better fuse local contextual information, we introduce a one-dimensional convolution neural network to generate the query and value matrices in the self-attention mechanism, taking advantage of the strong correlation between input characters. We apply this variant of the self-attention network to text-to-speech tasks and propose a non-autoregressive neural text-to-speech model. To enhance pronunciation accuracy, we separate tones from phonemes as independent features in model training. Experimental results show that our model yields good performance in speech synthesis. Specifically, the proposed method significantly improves the processing of pause, stress, and intonation in speech.

引用

页数：11

共 50 条

[21] Multilingual text analysis for text-to-speech synthesis
Bell Lab, Murray Hill, United States
International Conference on Spoken Language Processing, ICSLP, Proceedings, 1996, 3 : 1365 - 1368
[22] Multilingual text analysis for text-to-speech synthesis
Sproat, R
ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1365 - 1368
[23] Modeling stylized invariance and local variability of prosody in text-to-speech synthesis
Chu, Min
Zhao, Yong
Chang, Eric
SPEECH COMMUNICATION, 2006, 48 (06) : 716 - 726
[24] A hybrid model for text-to-speech synthesis
Violaro, F
Boeffard, O
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (05): : 426 - 434
[25] Environment Aware Text-to-Speech Synthesis
Tan, Daxin
Zhang, Guangyan
Lee, Tan
INTERSPEECH 2022, 2022, : 481 - 485
[26] Text-to-speech synthesis integrated circuit
Baskaya, IF
Aktan, O
Dündar, G
PROCEEDINGS OF THE IEEE 12TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, 2004, : 653 - 656
[27] PHONETIC KNOWLEDGE IN TEXT-TO-SPEECH SYNTHESIS
van Santen, Jan P. H.
INTEGRATION OF PHONETIC KNOWLEDGE IN SPEECH TECHNOLOGY, 2005, 25 : 149 - 166
[28] Prosody generation in text-to-speech conversion using dependency graphs
Lindstrom, A
Bretan, I
Ljungqvist, M
ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1341 - 1344
[29] CLUSTERING OF DURATION PATTERNS IN SPEECH FOR TEXT-TO-SPEECH SYNTHESIS
Sreelekshmi, K. S.
Gopinath, Deepa P.
2012 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2012, : 1122 - 1127
[30] Enhancing Sequence-to-Sequence Text-to-Speech with Morphology
Taylor, Jason
Richmond, Korin
INTERSPEECH 2020, 2020, : 1738 - 1742

← 1 2 3 4 5 →