UNSUPERVISED WORD-LEVEL PROSODY TAGGING FOR CONTROLLABLE SPEECH SYNTHESIS

被引：4

作者：

Guo, Yiwei ^{[1
]}

Du, Chenpeng ^{[1
]}

Yu, Kai ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, AI Inst, Dept Comp Sci & Engn, MoE Key Lab Artificial Intelligence,X LANCE Lab, Shanghai, Peoples R China

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

Prosody control; prosody tagging; word-level prosody; speech synthesis;

D O I：

10.1109/ICASSP43922.2022.9746323

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference. This is largely due to lack of word-level prosody tags. In this work, we propose a novel approach for unsupervised word-level prosody tagging with two stages, where we first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM within each type of words separately. This design is based on the assumption that the prosodies of different type of words, such as long or short words, should be tagged with different label sets. Furthermore, a TTS system with the derived word-level prosody tags is trained for controllable speech synthesis. Experiments on LJSpeech show that the TTS model trained with word-level prosody tags not only achieves better naturalness than a typical FastSpeech2 model, but also gains the ability to manipulate word-level prosody.

引用

页码：7597 / 7601

页数：5

共 50 条

[1] Word-level Text Markup for Prosody Control in Speech Synthesis
Korotkova, Yuliya
Kalinovskiy, Ilya
Vakhrusheva, Tatiana
INTERSPEECH 2024, 2024, : 2280 - 2284
[2] The word-level prosody of Samoan
Zuraw, Kie
Yu, Kristine M.
Orfitelli, Robyn
PHONOLOGY, 2014, 31 (02) : 271 - 327
[3] The Phonetics of Paiwan Word-Level Prosody
Chen, Chun-Mei
LANGUAGE AND LINGUISTICS, 2009, 10 (03) : 593 - 625
[4] The where and when of linguistic word-level prosody
Arciuli, Joanne
Slowiaczek, Louisa M.
NEUROPSYCHOLOGIA, 2007, 45 (11) : 2638 - 2642
[5] Prosody Aware Word-level Encoder Based on BLSTM-RNNs for DNN-based Speech Synthesis
Ijima, Yusuke
Hojo, Nobukatsu
Masumura, Ryo
Asami, Taichi
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 764 - 768
[6] Stress, duration, and intonation in Arabic word-level prosody
de Jong, K
Zawaydeh, BA
JOURNAL OF PHONETICS, 1999, 27 (01) : 3 - 22
[7] Extracting and Predicting Word-Level Style Variations for Speech Synthesis
Zhang, Ya-Jie
Ling, Zhen-Hua
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 (29) : 1582 - 1593
[8] Classifying Turn-Level Uncertainty Using Word-Level Prosody
Litman, Diane
Rotaru, Mihai
Nicholas, Greg
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1971 - +
[9] Fluent Personalized Speech Synthesis with Prosodic Word-Level Spontaneous Speech generation
Huang, Yi-Chin
Wu, Chung-Hsien
Shie, Ming-Ge
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 294 - 298
[10] Word-level Speech Recognition with a Letter to Word Encoder
Collobert, Ronan
Hannun, Awni
Synnaeve, Gabriel
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119

← 1 2 3 4 5 →