Word-level Text Markup for Prosody Control in Speech Synthesis

被引：0

作者：

Korotkova, Yuliya ^{[1
,2
]}

Kalinovskiy, Ilya ^{[1
,3
]}

Vakhrusheva, Tatiana ^{[1
,2
]}

机构：

[1] JustAI, St Petersburg, Russia

[2] Higher Sch Econ, Moscow, Russia

[3] Tomsk Polytech Univ, Sch Comp Sci & Robot, Tomsk, Russia

来源：

INTERSPEECH 2024 | 2024年

关键词：

prosody control; prosody tagging; word-level prosody; speech synthesis; TTS;

D O I：

10.21437/Interspeech.2024-715

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Modern Text-to-Speech (TTS) technologies generate speech very close to the natural one, but synthesized voices still lack variation in intonation which, in addition, is hard to control. In this work, we address the problem of prosody control, aiming to capture information about intonation in a markup without hand-labeling and linguistic expertise. We propose a method of encoding prosodic knowledge from textual and acoustic modalities, which are obtained with the help of models pretrained on self-supervised tasks, into latent quantized space with interpretable features. Based on these features, the prosodic markup is constructed, and it is used as an additional input to the TTS model to solve the one-to-many problem and is predicted by text. Moreover, this method allows for prosody control during inference time and scalability to new data and other languages.

引用

页码：2280 / 2284

页数：5

共 50 条

[1] UNSUPERVISED WORD-LEVEL PROSODY TAGGING FOR CONTROLLABLE SPEECH SYNTHESIS
Guo, Yiwei
Du, Chenpeng
Yu, Kai
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7597 - 7601
[2] The word-level prosody of Samoan
Zuraw, Kie
Yu, Kristine M.
Orfitelli, Robyn
PHONOLOGY, 2014, 31 (02) : 271 - 327
[3] MODEL FOR WORD-LEVEL CONVERSION OF ARBITRARY TEXT TO SPEECH
ALLEN, J
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1973, 53 (01): : 356 - &
[4] The Phonetics of Paiwan Word-Level Prosody
Chen, Chun-Mei
LANGUAGE AND LINGUISTICS, 2009, 10 (03) : 593 - 625
[5] The where and when of linguistic word-level prosody
Arciuli, Joanne
Slowiaczek, Louisa M.
NEUROPSYCHOLOGIA, 2007, 45 (11) : 2638 - 2642
[6] Prosody Aware Word-level Encoder Based on BLSTM-RNNs for DNN-based Speech Synthesis
Ijima, Yusuke
Hojo, Nobukatsu
Masumura, Ryo
Asami, Taichi
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 764 - 768
[7] Stress, duration, and intonation in Arabic word-level prosody
de Jong, K
Zawaydeh, BA
JOURNAL OF PHONETICS, 1999, 27 (01) : 3 - 22
[8] Extracting and Predicting Word-Level Style Variations for Speech Synthesis
Zhang, Ya-Jie
Ling, Zhen-Hua
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 (29) : 1582 - 1593
[9] Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis
Zhou, Yixuan
Song, Changhe
Li, Jingbei
Wu, Zhiyong
Bian, Yanyao
Su, Dan
Meng, Helen
INTERSPEECH 2022, 2022, : 5518 - 5522
[10] Classifying Turn-Level Uncertainty Using Word-Level Prosody
Litman, Diane
Rotaru, Mihai
Nicholas, Greg
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1971 - +

← 1 2 3 4 5 →