Exploring the Robustness of Text-to-Speech Synthesis Based on Diffusion Probabilistic Models to Heavily Noisy Transcriptions

被引：0

作者：

Feng, Jingyi ^{[1
]}

Yasuda, Yusuke ^{[1
]}

Toda, Tomoki ^{[1
]}

机构：

[1] Nagoya Univ, Nagoya, Aichi, Japan

来源：

INTERSPEECH 2024 | 2024年

关键词：

speech synthesis; noisy transcription; diffusion-based model;

D O I：

10.21437/Interspeech.2024-2337

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large data volumes can benefit text-to-speech (TTS), but speech data with high-quality annotation is limited. Automatic transcription enables the transcription of found speech data to enhance the data volume for TTS, but TTS training suffers from transcription errors. In this paper, we investigate the robustness of typical TTS models against heavily noisy transcripts, including diffusion, flow, and autoregressive-based TTS models, in terms of objective intelligibility and subjective naturalness. Our experimental results show that diffusion-based TTS is extremely robust to heavily noisy transcriptions, mitigating about 30% of the word error rate compared to autoregressive and flow-based models. We also show that iterative inference with a long diffusion time is key to the robustness of diffusion-based TTS based on likelihood analysis.

引用

页码：4408 / 4412

页数：5

共 50 条

[31] Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis
Ankit Sharma
Puneet Kumar
Vikas Maddukuri
Nagasai Madamshetti
K. G. Kishore
Sahit Sai Sriram Kavuru
Balasubramanian Raman
Partha Pratim Roy
Multimedia Tools and Applications, 2020, 79 : 30205 - 30233
[32] Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
Yamagishi, Junichi
Nose, Takashi
Zen, Heiga
Ling, Zhen-Hua
Toda, Tomoki
Tokuda, Keiichi
King, Simon
Renals, Steve
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (06): : 1208 - 1230
[33] Epochs Based Compression of LP Residual for Source Modeling in Text-to-Speech Synthesis
Adiga, Nagaraj
Prasanna, S. R. Mahadeva
2014 TWENTIETH NATIONAL CONFERENCE ON COMMUNICATIONS (NCC), 2014,
[34] A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese
Chou, FC
Tseng, CY
Lee, LS
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2002, 10 (07): : 481 - 494
[35] Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis
Sharma, Ankit
Kumar, Puneet
Maddukuri, Vikas
Madamshetti, Nagasai
Kishore, K. G.
Kavuru, Sahit Sai Sriram
Raman, Balasubramanian
Roy, Partha Pratim
MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (41-42) : 30205 - 30233
[36] Integrating coding techniques into LP-based Mandarin text-to-speech synthesis
Hu H.-T.
Wang H.-M.
Int J Speech Technol, 2007, 1 (31-44): : 31 - 44
[37] Improvements of Hungarian Hidden Markov Model-based Text-to-Speech Synthesis
Toth, Balint
Nemeth, Geza
ACTA CYBERNETICA, 2010, 19 (04): : 715 - 731
[38] Integrating Articulatory Information in Deep Learning-based Text-to-Speech Synthesis
Cao, Beiming
Kim, Myungjong
van Santen, Jan
Mau, Ted
Wang, Jun
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 254 - 258
[39] ProTIP: Probabilistic Robustness Verification on Text-to-Image Diffusion Models Against Stochastic Perturbation
Zhang, Yi
Tang, Yun
Ruan, Wenjie
Huang, Xiaowei
Khastgir, Siddartha
Jennings, Paul
Zhao, Xingyu
COMPUTER VISION - ECCV 2024, PT XXXII, 2025, 15090 : 455 - 472
[40] RECENT IMPROVEMENTS OF PROBABILITY BASED PROSODY MODELS FOR UNIT SELECTION IN CONCATENATIVE TEXT-TO-SPEECH
Zhang, Wei
Gu, Liang
Gao, Yuqing
2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 3777 - 3780

← 1 2 3 4 5 →