E3TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications

被引:0
|
作者
Liang, Zheng [1 ]
Ma, Ziyang [1 ]
Du, Chenpeng [1 ]
Yu, Kai [1 ]
Chen, Xie [1 ]
机构
[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, Dept Comp Sci & Engn, X LANCE Lab,AI Inst, Shanghai 200240, Peoples R China
基金
中国国家自然科学基金;
关键词
Hidden Markov models; Speech recognition; Data augmentation; Acoustics; Context modeling; Speech coding; Predictive models; Decoding; Splicing; Training; Automatic speech recognition; code-switching; data augmentation; named entity recognition; text-based speech editing; text-to-speech; ASR;
D O I
10.1109/TASLP.2024.3485466
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the enhanced capability of neural Text-to-speech (TTS), researchers try to tackle speech editing problems with TTS methods. In this paper, we propose E3TTS, a.k.a. end-to-end text-based speech editing TTS system, which combines a text encoder, a speech encoder, and a joint net for speech synthesis and speech editing. E3TTS can insert, replace, and delete speech content at will, by manipulating the given text. Experiments show that our speech editing outperforms strong baselines on HiFiTTS and LibriTTS datasets, speakers of which are seen or unseen, respectively. Further, we introduce E3TTS into data augmentation for automatic speech recognition (ASR) to mitigate the data insufficiency problem in code-switching and named entity recognition scenarios1. E3TTS retains the coherence and reality of the recorded audio compared to past data augmentation methods. The experimental results show significant performance improvements over baseline systems with traditional TTS-based data augmentation. The code and samples of the proposed speech editing model are available at this repository.2
引用
收藏
页码:4810 / 4821
页数:12
相关论文
共 50 条
  • [1] Emotion selectable end-to-end text-based speech editing
    Wang, Tao
    Yi, Jiangyan
    Fu, Ruibo
    Tao, Jianhua
    Wen, Zhengqi
    Zhang, Chu Yuan
    ARTIFICIAL INTELLIGENCE, 2024, 329
  • [2] A Novel End-to-End Turkish Text-to-Speech (TTS) System via Deep Learning
    Oyucu, Saadin
    ELECTRONICS, 2023, 12 (08)
  • [3] SR-TTS: a rhyme-based end-to-end speech synthesis system
    Yao, Yihao
    Liang, Tao
    Feng, Rui
    Shi, Keke
    Yu, Junxiao
    Wang, Wei
    Li, Jianqing
    FRONTIERS IN NEUROROBOTICS, 2024, 18
  • [4] SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
    Cho, Hyunjae
    Jung, Wonbin
    Lee, Junhyeok
    Woo, Sang Hoon
    INTERSPEECH 2022, 2022, : 1 - 5
  • [5] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
    Pamisetty, Giridhar
    Murty, K. Sri Rama
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2023, 42 (01) : 361 - 384
  • [6] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
    Giridhar Pamisetty
    K. Sri Rama Murty
    Circuits, Systems, and Signal Processing, 2023, 42 : 361 - 384
  • [7] CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing
    Wang, Tao
    Yi, Jiangyan
    Fu, Ruibo
    Tao, Jianhua
    Wen, Zhengqi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2241 - 2254
  • [8] CONTEXT-AWARE MASK PREDICTION NETWORK FOR END-TO-END TEXT-BASED SPEECH EDITING
    Wang, Tao
    Yi, Jiangyan
    Deng, Liqun
    Fu, Ruibo
    Tao, Jianhua
    Wen, Zhengqi
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6082 - 6086
  • [9] Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture
    Fahmy, Fady K.
    Abbas, Hazem M.
    Khalil, Mahmoud, I
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2022, 25 (01) : 79 - 88
  • [10] Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture
    Fady K. Fahmy
    Hazem M. Abbas
    Mahmoud I. Khalil
    International Journal of Speech Technology, 2022, 25 : 79 - 88