Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis

被引：2

作者：

Zhou, Yixuan ^{[1
,4
]}

Song, Changhe ^{[1
]}

Li, Jingbei ^{[1
]}

Wu, Zhiyong ^{[1
,2
]}

Bian, Yanyao ^{[3
]}

Su, Dan ^{[3
]}

Meng, Helen ^{[2
]}

机构：

[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China

[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[3] Tencent, Tencent AI Lab, Shenzhen, Peoples R China

[4] Tencent, Shenzhen, Peoples R China

来源：

INTERSPEECH 2022 | 2022年

基金：

中国国家自然科学基金;

关键词：

expressive speech synthesis; semantic representation enhancing; dependency parsing; graph neural network;

D O I：

10.21437/Interspeech.2022-10061

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Exploiting rich linguistic information in raw text is crucial for expressive text-to-speech (TTS). As large scale pre-trained text representation develops, bidirectional encoder representations from Transformers (BERT) has been proven to embody semantic information and employed to TTS recently. However, original or simply fine-tuned BERT embeddings still cannot provide sufficient semantic knowledge that expressive TTS models should take into account. In this paper, we propose a word-level semantic representation enhancing method based on dependency structure and pre-trained BERT embedding. The BERT embedding of each word is reprocessed considering its specific dependencies and related words in the sentence, to generate more effective semantic representation for TTS. To better utilize the dependency structure, relational gated graph network (RGGN) is introduced to make semantic information flow and aggregate through the dependency structure. The experimental results show that the proposed method can further improve the naturalness and expressiveness of synthesized speeches on both Mandarin and English datasets(1).

引用

页码：5518 / 5522

页数：5

共 50 条

[31] Phonemic transcription by analogy in text-to-speech synthesis: Novel word pronunciation and lexicon compression
Bagshaw, PC
COMPUTER SPEECH AND LANGUAGE, 1998, 12 (02): : 119 - 142
[32] Statistical Text-to-Speech Synthesis Based on Segment-Wise Representation With a Norm Constraint
Tiomkin, Stas
Malah, David
Shechtman, Slava
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (05): : 1077 - 1082
[33] Relation Extraction in Vietnamese Text via Piecewise Convolution Neural Network with Word-Level Attention
Van-Nhat Nguyen
Ha-Thanh Nguyen
Dinh-Hieu Vo
Le-Minh Nguyen
PROCEEDINGS OF 2018 5TH NAFOSTED CONFERENCE ON INFORMATION AND COMPUTER SCIENCE (NICS 2018), 2018, : 99 - 103
[34] Enhancing Local Dependencies for Transformer-Based Text-to-Speech via Hybrid Lightweight Convolution
Zhao, Wei
He, Ting
Xu, Li
IEEE ACCESS, 2021, 9 : 42762 - 42770
[35] ZERO-SHOT TEXT-TO-SPEECH SYNTHESIS CONDITIONED USING SELF-SUPERVISED SPEECH REPRESENTATION MODEL
Fujita, Kenichi
Ashihara, Takanori
Kanagawa, Hiroki
Moriya, Takafumi
Ijima, Yusuke
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
[36] Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation
Tu, Tao
Chen, Yuan-Jui
Liu, Alexander H.
Lee, Hung-yi
INTERSPEECH 2020, 2020, : 3191 - 3195
[37] MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis
Guan, Wenhao
Li, Yishuang
Li, Tao
Huang, Hukai
Wang, Feng
Lin, Jiayan
Huang, Lingyan
Li, Lin
Hong, Qingyang
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18117 - 18125
[38] NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality
Tan, Xu
Chen, Jiawei
Liu, Haohe
Cong, Jian
Zhang, Chen
Liu, Yanqing
Wang, Xi
Leng, Yichong
Yi, Yuanhao
He, Lei
Zhao, Sheng
Qin, Tao
Soong, Frank
Liu, Tie-Yan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (06) : 4234 - 4245
[39] Low-level articulatory synthesis: A working text-to-speech solution and a linguistic tool
Hill, David R.
Taube-Schock, Craig R.
Manzara, Leonard
CANADIAN JOURNAL OF LINGUISTICS-REVUE CANADIENNE DE LINGUISTIQUE, 2017, 62 (03): : 371 - 410
[40] Learning emotions latent representation with CVAE for text-driven expressive audiovisual speech synthesis
Dahmani, Sara
Colotte, Vincent
Girard, Valerian
Ouni, Slim
NEURAL NETWORKS, 2021, 141 (141) : 315 - 329

← 1 2 3 4 5 →