Exploring the Robustness of Text-to-Speech Synthesis Based on Diffusion Probabilistic Models to Heavily Noisy Transcriptions

被引:0
|
作者
Feng, Jingyi [1 ]
Yasuda, Yusuke [1 ]
Toda, Tomoki [1 ]
机构
[1] Nagoya Univ, Nagoya, Aichi, Japan
来源
关键词
speech synthesis; noisy transcription; diffusion-based model;
D O I
10.21437/Interspeech.2024-2337
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large data volumes can benefit text-to-speech (TTS), but speech data with high-quality annotation is limited. Automatic transcription enables the transcription of found speech data to enhance the data volume for TTS, but TTS training suffers from transcription errors. In this paper, we investigate the robustness of typical TTS models against heavily noisy transcripts, including diffusion, flow, and autoregressive-based TTS models, in terms of objective intelligibility and subjective naturalness. Our experimental results show that diffusion-based TTS is extremely robust to heavily noisy transcriptions, mitigating about 30% of the word error rate compared to autoregressive and flow-based models. We also show that iterative inference with a long diffusion time is key to the robustness of diffusion-based TTS based on likelihood analysis.
引用
收藏
页码:4408 / 4412
页数:5
相关论文
共 50 条
  • [1] Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech
    Popov, Vadim
    Vovk, Ivan
    Gogoryan, Vladimir
    Sadekova, Tasnima
    Kudinov, Mikhail
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [2] On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models
    Varshaysky-Hassid, Miri
    Hirsch, Roy
    Cohen, Regev
    Golany, Tomer
    Freedman, Daniel
    Rivlin, Ehud
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 246 - 255
  • [3] EXPLORING END-TO-END NEURAL TEXT-TO-SPEECH SYNTHESIS FOR ROMANIAN
    Dumitrache, Marius
    Rebedea, Traian
    PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE LINGUISTIC RESOURCES AND TOOLS FOR NATURAL LANGUAGE PROCESSING, 2020, : 93 - 102
  • [4] Intensity Modeling for Syllable Based Text-to-Speech Synthesis
    Reddy, V. Ramu
    Rao, K. Sreenivasa
    CONTEMPORARY COMPUTING, 2012, 306 : 106 - 117
  • [5] Residual-based speech modification algorithms for text-to-speech synthesis
    Edgington, M
    Lowry, A
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1425 - 1428
  • [6] ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models
    Kang, Minki
    Han, Wooseok
    Hwang, Sung Ju
    Yang, Eunho
    INTERSPEECH 2023, 2023, : 4339 - 4343
  • [7] Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech
    Zhang, Guangyan
    Merritt, Thomas
    Ribeiro, Manuel Sam
    Tura-Vecino, Biel
    Yanagisawa, Kayoko
    Pokora, Kamil
    Ezzerg, Abdelhamid
    Cygert, Sebastian
    Abbas, Ammar
    Bilinski, Piotr
    Barra-Chicote, Roberto
    Korzekwa, Daniel
    Lorenzo-Trueba, Jaime
    INTERSPEECH 2023, 2023, : 27 - 31
  • [8] Corpus-based Malay Text-to-Speech Synthesis System
    Swee, Tan Tian
    Salleh, Sheikh Hussain Shaikh
    2008 14TH ASIA-PACIFIC CONFERENCE ON COMMUNICATIONS, (APCC), VOLS 1 AND 2, 2008, : 52 - 56
  • [9] A RULE BASED PROSODY MODEL FOR TURKISH TEXT-TO-SPEECH SYNTHESIS
    Uslu, Ibrahim Baran
    Ilk, Hakki Gokhan
    Yilmaz, Asim Egemen
    TEHNICKI VJESNIK-TECHNICAL GAZETTE, 2013, 20 (02): : 217 - 223
  • [10] [Invited] Generative Model-Based Text-to-Speech Synthesis
    Zen, Heiga
    2018 IEEE 7TH GLOBAL CONFERENCE ON CONSUMER ELECTRONICS (GCCE 2018), 2018, : 327 - 328