E3TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications

被引:0
|
作者
Liang, Zheng [1 ]
Ma, Ziyang [1 ]
Du, Chenpeng [1 ]
Yu, Kai [1 ]
Chen, Xie [1 ]
机构
[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, Dept Comp Sci & Engn, X LANCE Lab,AI Inst, Shanghai 200240, Peoples R China
基金
中国国家自然科学基金;
关键词
Hidden Markov models; Speech recognition; Data augmentation; Acoustics; Context modeling; Speech coding; Predictive models; Decoding; Splicing; Training; Automatic speech recognition; code-switching; data augmentation; named entity recognition; text-based speech editing; text-to-speech; ASR;
D O I
10.1109/TASLP.2024.3485466
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the enhanced capability of neural Text-to-speech (TTS), researchers try to tackle speech editing problems with TTS methods. In this paper, we propose E3TTS, a.k.a. end-to-end text-based speech editing TTS system, which combines a text encoder, a speech encoder, and a joint net for speech synthesis and speech editing. E3TTS can insert, replace, and delete speech content at will, by manipulating the given text. Experiments show that our speech editing outperforms strong baselines on HiFiTTS and LibriTTS datasets, speakers of which are seen or unseen, respectively. Further, we introduce E3TTS into data augmentation for automatic speech recognition (ASR) to mitigate the data insufficiency problem in code-switching and named entity recognition scenarios1. E3TTS retains the coherence and reality of the recorded audio compared to past data augmentation methods. The experimental results show significant performance improvements over baseline systems with traditional TTS-based data augmentation. The code and samples of the proposed speech editing model are available at this repository.2
引用
收藏
页码:4810 / 4821
页数:12
相关论文
共 50 条
  • [41] Speech Vision: An End-to-End Deep Learning-Based Dysarthric Automatic Speech Recognition System
    Shahamiri, Seyed Reza
    IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2021, 29 : 852 - 861
  • [42] End-to-end Triplet Loss based Emotion Embedding System for Speech Emotion Recognition
    Kumar, Puneet
    Jain, Sidharth
    Raman, Balasubramanian
    Roy, Partha Pratim
    Iwamura, Masakazu
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 8766 - 8773
  • [43] A study of transformer-based end-to-end speech recognition system for Kazakh language
    Mamyrbayev, Orken
    Oralbekova, Dina
    Alimhan, Keylan
    Turdalykyzy, Tolganay
    Othman, Mohamed
    SCIENTIFIC REPORTS, 2022, 12 (01)
  • [44] A study of transformer-based end-to-end speech recognition system for Kazakh language
    Mamyrbayev Orken
    Oralbekova Dina
    Alimhan Keylan
    Turdalykyzy Tolganay
    Othman Mohamed
    Scientific Reports, 12
  • [45] An End-to-End e-Election System Based on Multimodal Identification and Authentication
    Ayo, Charles
    Daramola, Justine
    Gabriel, Obi
    Sofoluwe, Adetokunbo
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON E-GOVERNMENT, 2010, : 10 - 17
  • [46] Advance research in agricultural text-to-speech: the word segmentation of analytic language and the deep learning-based end-to-end system
    Li, Xinxing
    Ma, Diankun
    Yin, Baoquan
    COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2021, 180
  • [47] A Text Detection and Recognition System based on an End-to-End Trainable Framework from UAV Imagery
    Wu, Qingtian
    Zhou, Yimin
    Liang, Guoyuan
    2018 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS (ROBIO), 2018, : 736 - 741
  • [48] E2E-DASR: End-to-end deep learning-based dysarthric automatic speech recognition
    Almadhor, Ahmad
    Irfan, Rizwana
    Gao, Jiechao
    Saleem, Nasir
    Rauf, Hafiz Tayyab
    Kadry, Seifedine
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 222
  • [49] BLSTM-CRF Based End-to-End Prosodic Boundary Prediction with Context Sensitive Embeddings in A Text-to-Speech Front-End
    Zheng, Yibin
    Tao, Jianhua
    Wen, Zhengqi
    Li, Ya
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 47 - 51
  • [50] A New End-to-End Long-Time Speech Synthesis System Based on Tacotron2
    Liu, Renyuan
    Yang, Jian
    Liu, Mengyuan
    2019 INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING SYSTEMS (SPSS 2019), 2019, : 46 - 50