Exploring the Robustness of Text-to-Speech Synthesis Based on Diffusion Probabilistic Models to Heavily Noisy Transcriptions

被引:0
|
作者
Feng, Jingyi [1 ]
Yasuda, Yusuke [1 ]
Toda, Tomoki [1 ]
机构
[1] Nagoya Univ, Nagoya, Aichi, Japan
来源
关键词
speech synthesis; noisy transcription; diffusion-based model;
D O I
10.21437/Interspeech.2024-2337
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large data volumes can benefit text-to-speech (TTS), but speech data with high-quality annotation is limited. Automatic transcription enables the transcription of found speech data to enhance the data volume for TTS, but TTS training suffers from transcription errors. In this paper, we investigate the robustness of typical TTS models against heavily noisy transcripts, including diffusion, flow, and autoregressive-based TTS models, in terms of objective intelligibility and subjective naturalness. Our experimental results show that diffusion-based TTS is extremely robust to heavily noisy transcriptions, mitigating about 30% of the word error rate compared to autoregressive and flow-based models. We also show that iterative inference with a long diffusion time is key to the robustness of diffusion-based TTS based on likelihood analysis.
引用
收藏
页码:4408 / 4412
页数:5
相关论文
共 50 条
  • [31] Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis
    Ankit Sharma
    Puneet Kumar
    Vikas Maddukuri
    Nagasai Madamshetti
    K. G. Kishore
    Sahit Sai Sriram Kavuru
    Balasubramanian Raman
    Partha Pratim Roy
    Multimedia Tools and Applications, 2020, 79 : 30205 - 30233
  • [32] Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
    Yamagishi, Junichi
    Nose, Takashi
    Zen, Heiga
    Ling, Zhen-Hua
    Toda, Tomoki
    Tokuda, Keiichi
    King, Simon
    Renals, Steve
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (06): : 1208 - 1230
  • [33] Epochs Based Compression of LP Residual for Source Modeling in Text-to-Speech Synthesis
    Adiga, Nagaraj
    Prasanna, S. R. Mahadeva
    2014 TWENTIETH NATIONAL CONFERENCE ON COMMUNICATIONS (NCC), 2014,
  • [34] A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese
    Chou, FC
    Tseng, CY
    Lee, LS
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2002, 10 (07): : 481 - 494
  • [35] Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis
    Sharma, Ankit
    Kumar, Puneet
    Maddukuri, Vikas
    Madamshetti, Nagasai
    Kishore, K. G.
    Kavuru, Sahit Sai Sriram
    Raman, Balasubramanian
    Roy, Partha Pratim
    MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (41-42) : 30205 - 30233
  • [36] Integrating coding techniques into LP-based Mandarin text-to-speech synthesis
    Hu H.-T.
    Wang H.-M.
    Int J Speech Technol, 2007, 1 (31-44): : 31 - 44
  • [37] Improvements of Hungarian Hidden Markov Model-based Text-to-Speech Synthesis
    Toth, Balint
    Nemeth, Geza
    ACTA CYBERNETICA, 2010, 19 (04): : 715 - 731
  • [38] Integrating Articulatory Information in Deep Learning-based Text-to-Speech Synthesis
    Cao, Beiming
    Kim, Myungjong
    van Santen, Jan
    Mau, Ted
    Wang, Jun
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 254 - 258
  • [39] ProTIP: Probabilistic Robustness Verification on Text-to-Image Diffusion Models Against Stochastic Perturbation
    Zhang, Yi
    Tang, Yun
    Ruan, Wenjie
    Huang, Xiaowei
    Khastgir, Siddartha
    Jennings, Paul
    Zhao, Xingyu
    COMPUTER VISION - ECCV 2024, PT XXXII, 2025, 15090 : 455 - 472
  • [40] RECENT IMPROVEMENTS OF PROBABILITY BASED PROSODY MODELS FOR UNIT SELECTION IN CONCATENATIVE TEXT-TO-SPEECH
    Zhang, Wei
    Gu, Liang
    Gao, Yuqing
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 3777 - 3780