Word-level Text Markup for Prosody Control in Speech Synthesis

被引:0
|
作者
Korotkova, Yuliya [1 ,2 ]
Kalinovskiy, Ilya [1 ,3 ]
Vakhrusheva, Tatiana [1 ,2 ]
机构
[1] JustAI, St Petersburg, Russia
[2] Higher Sch Econ, Moscow, Russia
[3] Tomsk Polytech Univ, Sch Comp Sci & Robot, Tomsk, Russia
来源
关键词
prosody control; prosody tagging; word-level prosody; speech synthesis; TTS;
D O I
10.21437/Interspeech.2024-715
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modern Text-to-Speech (TTS) technologies generate speech very close to the natural one, but synthesized voices still lack variation in intonation which, in addition, is hard to control. In this work, we address the problem of prosody control, aiming to capture information about intonation in a markup without hand-labeling and linguistic expertise. We propose a method of encoding prosodic knowledge from textual and acoustic modalities, which are obtained with the help of models pretrained on self-supervised tasks, into latent quantized space with interpretable features. Based on these features, the prosodic markup is constructed, and it is used as an additional input to the TTS model to solve the one-to-many problem and is predicted by text. Moreover, this method allows for prosody control during inference time and scalability to new data and other languages.
引用
收藏
页码:2280 / 2284
页数:5
相关论文
共 50 条
  • [31] ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition
    Li, Yuanchao
    Zhao, Zeyu
    Klejch, Ondrej
    Bell, Peter
    Lai, Catherine
    INTERSPEECH 2023, 2023, : 1449 - 1453
  • [32] Cascaded Segmentation-Detection Networks for Word-Level Text Spotting
    Qin, Siyang
    Manduchi, Roberto
    2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 1275 - 1282
  • [33] Apraxia of speech as a disruption of word-level schemata: Some durational evidence
    Varley, R
    Whiteside, S
    Luff, H
    JOURNAL OF MEDICAL SPEECH-LANGUAGE PATHOLOGY, 1999, 7 (02) : 127 - 132
  • [34] Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis
    Liu, Zhaoci
    Wu, Ningqian
    Zhang, Yajie
    Ling, Zhenhua
    INTERSPEECH 2022, 2022, : 5508 - 5512
  • [35] Combining conversational speech with read speech to improve prosody in Text-to-Speech synthesis
    O'Mahony, Johannah
    Lai, Catherine
    King, Simon
    INTERSPEECH 2022, 2022, : 3388 - 3392
  • [36] Speech Modification for Prosody Conversion in Expressive Marathi Text-to-Speech Synthesis
    Anil, Manjare Chandraprabha
    Shirbahadurkar, S. D.
    2014 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2014, : 56 - 58
  • [37] Relating foveal and parafoveal processing efficiency with word-level parameters in text reading
    Heikkila, Timo T.
    Soralinna, Nea
    Hyona, Jukka
    JOURNAL OF MEMORY AND LANGUAGE, 2024, 137
  • [38] Word-level emotion distribution with two schemas for short text emotion classification
    Li, Zongxi
    Xie, Haoran
    Cheng, Gary
    Li, Qing
    KNOWLEDGE-BASED SYSTEMS, 2021, 227
  • [39] Evaluation of Prosody in Text-to-Speech Synthesis System of Bangla
    Basu, Tulika
    Saha, Arup
    2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
  • [40] Modeling word-level rate-of-speech variation in large vocabulary conversational speech recognition
    Zheng, J
    Franco, H
    Stolcke, A
    SPEECH COMMUNICATION, 2003, 41 (2-3) : 273 - 285