E3TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications

被引:0
|
作者
Liang, Zheng [1 ]
Ma, Ziyang [1 ]
Du, Chenpeng [1 ]
Yu, Kai [1 ]
Chen, Xie [1 ]
机构
[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, Dept Comp Sci & Engn, X LANCE Lab,AI Inst, Shanghai 200240, Peoples R China
基金
中国国家自然科学基金;
关键词
Hidden Markov models; Speech recognition; Data augmentation; Acoustics; Context modeling; Speech coding; Predictive models; Decoding; Splicing; Training; Automatic speech recognition; code-switching; data augmentation; named entity recognition; text-based speech editing; text-to-speech; ASR;
D O I
10.1109/TASLP.2024.3485466
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the enhanced capability of neural Text-to-speech (TTS), researchers try to tackle speech editing problems with TTS methods. In this paper, we propose E3TTS, a.k.a. end-to-end text-based speech editing TTS system, which combines a text encoder, a speech encoder, and a joint net for speech synthesis and speech editing. E3TTS can insert, replace, and delete speech content at will, by manipulating the given text. Experiments show that our speech editing outperforms strong baselines on HiFiTTS and LibriTTS datasets, speakers of which are seen or unseen, respectively. Further, we introduce E3TTS into data augmentation for automatic speech recognition (ASR) to mitigate the data insufficiency problem in code-switching and named entity recognition scenarios1. E3TTS retains the coherence and reality of the recorded audio compared to past data augmentation methods. The experimental results show significant performance improvements over baseline systems with traditional TTS-based data augmentation. The code and samples of the proposed speech editing model are available at this repository.2
引用
收藏
页码:4810 / 4821
页数:12
相关论文
共 50 条
  • [31] Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation
    Biadsy, Fadi
    Weiss, Ron J.
    Moreno, Pedro J.
    Kanvesky, Dimitri
    Jia, Ye
    INTERSPEECH 2019, 2019, : 4115 - 4119
  • [32] Knowledge-based Linguistic Encoding for End-to-End Mandarin Text-to-Speech Synthesis
    Li, Jingbei
    Wu, Zhiyong
    Li, Runnan
    Zhi, Pengpeng
    Yang, Song
    Meng, Helen
    INTERSPEECH 2019, 2019, : 4494 - 4498
  • [33] End-to-end text-to-speech synthesis with unaligned multiple language units based on attention
    Aso, Masashi
    Takamichi, Shinnosuke
    Saruwatari, Hiroshi
    INTERSPEECH 2020, 2020, : 4009 - 4013
  • [34] Deep-learning based end-to-end system for text reading in the wild
    Harizi, Riadh
    Walha, Rim
    Drira, Fadoua
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (17) : 24691 - 24719
  • [35] Deep-learning based end-to-end system for text reading in the wild
    Riadh Harizi
    Rim Walha
    Fadoua Drira
    Multimedia Tools and Applications, 2022, 81 : 24691 - 24719
  • [36] Hardware Accelerator for Transformer based End-to-End Automatic Speech Recognition System
    Yamini, Shaarada D.
    Mirishkar, Ganesh S.
    Vuppala, Anil Kumar
    Purini, Suresh
    2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW, 2023, : 93 - 100
  • [37] Development of CRF and CTC Based End-To-End Kazakh Speech Recognition System
    Oralbekova, Dina
    Mamyrbayev, Orken
    Othman, Mohamed
    Alimhan, Keylan
    Zhumazhanov, Bagashar
    Nuranbayeva, Bulbul
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2022, PT I, 2022, 13757 : 519 - 531
  • [38] End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue
    Mitsui, Kentaro
    Zhao, Tianyu
    Sawada, Kei
    Hono, Yukiya
    Nankaku, Yoshihiko
    Tokuda, Keiichi
    INTERSPEECH 2022, 2022, : 2328 - 2332
  • [39] Adaptive End-to-End Text-to-Speech Synthesis Based on Error Correction Feedback from Humans
    Fujii, Kazuki
    Saito, Yuki
    Saruwatari, Hiroshi
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1702 - 1707
  • [40] Adaptive End-to-End Text-to-Speech Synthesis Based on Error Correction Feedback from Humans
    Fujii, Kazuki
    Saito, Yuki
    Saruwatari, Hiroshi
    Proceedings of 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2022, 2022, : 1702 - 1707