Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis

被引：2

作者：

Zhou, Yixuan ^{[1
,4
]}

Song, Changhe ^{[1
]}

Li, Jingbei ^{[1
]}

Wu, Zhiyong ^{[1
,2
]}

Bian, Yanyao ^{[3
]}

Su, Dan ^{[3
]}

Meng, Helen ^{[2
]}

机构：

[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China

[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[3] Tencent, Tencent AI Lab, Shenzhen, Peoples R China

[4] Tencent, Shenzhen, Peoples R China

来源：

INTERSPEECH 2022 | 2022年

基金：

中国国家自然科学基金;

关键词：

expressive speech synthesis; semantic representation enhancing; dependency parsing; graph neural network;

D O I：

10.21437/Interspeech.2022-10061

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Exploiting rich linguistic information in raw text is crucial for expressive text-to-speech (TTS). As large scale pre-trained text representation develops, bidirectional encoder representations from Transformers (BERT) has been proven to embody semantic information and employed to TTS recently. However, original or simply fine-tuned BERT embeddings still cannot provide sufficient semantic knowledge that expressive TTS models should take into account. In this paper, we propose a word-level semantic representation enhancing method based on dependency structure and pre-trained BERT embedding. The BERT embedding of each word is reprocessed considering its specific dependencies and related words in the sentence, to generate more effective semantic representation for TTS. To better utilize the dependency structure, relational gated graph network (RGGN) is introduced to make semantic information flow and aggregate through the dependency structure. The experimental results show that the proposed method can further improve the naturalness and expressiveness of synthesized speeches on both Mandarin and English datasets(1).

引用

页码：5518 / 5522

页数：5

共 50 条

[21] Modelling speech temporal structure for Estonian text-to-speech synthesis: Feature selection
Mihkla, Meelis
TRAMES-JOURNAL OF THE HUMANITIES AND SOCIAL SCIENCES, 2007, 11 (03): : 284 - 298
[22] Algorithms for Speech Segmentation at Syllable-Level for Text-to-Speech Synthesis System in Gujarati
Patil, Hemant A.
Patel, Tanvina
Talesara, Swati
Shah, Nirmesh
Sailor, Hardik
Vachhani, Bhavik
Akhani, Janki
Kanakiya, Bhargav
Gaur, Yashesh
Prajapati, Vibha
2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
[23] Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application
Uchimoto, Kiyotaka
Den, Yasuharu
SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 3118 - 3122
[24] CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis
Meng, Yi
Li, Xiang
Wu, Zhiyong
Li, Tingtian
Sun, Zixun
Xiao, Xinyu
Sun, Chi
Zhan, Hui
Meng, Helen
INTERSPEECH 2022, 2022, : 5533 - 5537
[25] RBCA-ETS: enhancing extractive text summarization with contextual embedding and word-level attention
Ravindra Gangundi
Rajeswari Sridhar
International Journal of Information Technology, 2025, 17 (2) : 1127 - 1135
[26] DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning
Saeki, Takaaki
Tachibana, Kentaro
Yamamoto, Ryuichi
INTERSPEECH 2022, 2022, : 793 - 797
[27] Measuring Semantic Similarity of Bengali Texts with Parts-of-Speech Tags and Word-Level Semantics
Atabuzzaman, Md
Shajalal, Md
2020 23RD INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY (ICCIT 2020), 2020,
[28] Enhancing Privacy Preservation with Quantum Computing for Word-Level Audio-Visual Speech Recognition
Wang, Chang
Du, Jun
Chen, Hang
Wang, Ruoyu
Yang, Chao-Han Huck
Zhao, Jiangjiang
Ren, Yuling
Li, Qinglong
Lee, Chin-Hui
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 635 - 642
[29] Document Structure Analysis and Text Normalization for Chinese Putonghua and Cantonese Text-to-Speech Synthesis
Zhou, Xinxin
Wu, Zhiyong
Yuan, Chun
Zhong, Yuzhuo
2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, VOL I, PROCEEDINGS, 2008, : 477 - 481
[30] Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features
Pastor, Eliana
Koudounas, Alkis
Attanasio, Giuseppe
Hovy, Dirk
Baralis, Elena
PROCEEDINGS OF THE 18TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 2221 - 2238

← 1 2 3 4 5 →