E3TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications

被引：0

作者：

Liang, Zheng ^{[1
]}

Ma, Ziyang ^{[1
]}

Du, Chenpeng ^{[1
]}

Yu, Kai ^{[1
]}

Chen, Xie ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, Dept Comp Sci & Engn, X LANCE Lab,AI Inst, Shanghai 200240, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Hidden Markov models; Speech recognition; Data augmentation; Acoustics; Context modeling; Speech coding; Predictive models; Decoding; Splicing; Training; Automatic speech recognition; code-switching; data augmentation; named entity recognition; text-based speech editing; text-to-speech; ASR;

D O I：

10.1109/TASLP.2024.3485466

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the enhanced capability of neural Text-to-speech (TTS), researchers try to tackle speech editing problems with TTS methods. In this paper, we propose E3TTS, a.k.a. end-to-end text-based speech editing TTS system, which combines a text encoder, a speech encoder, and a joint net for speech synthesis and speech editing. E3TTS can insert, replace, and delete speech content at will, by manipulating the given text. Experiments show that our speech editing outperforms strong baselines on HiFiTTS and LibriTTS datasets, speakers of which are seen or unseen, respectively. Further, we introduce E3TTS into data augmentation for automatic speech recognition (ASR) to mitigate the data insufficiency problem in code-switching and named entity recognition scenarios1. E3TTS retains the coherence and reality of the recorded audio compared to past data augmentation methods. The experimental results show significant performance improvements over baseline systems with traditional TTS-based data augmentation. The code and samples of the proposed speech editing model are available at this repository.2

引用

页码：4810 / 4821

页数：12

共 50 条

[31] Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation
Biadsy, Fadi
Weiss, Ron J.
Moreno, Pedro J.
Kanvesky, Dimitri
Jia, Ye
INTERSPEECH 2019, 2019, : 4115 - 4119
[32] Knowledge-based Linguistic Encoding for End-to-End Mandarin Text-to-Speech Synthesis
Li, Jingbei
Wu, Zhiyong
Li, Runnan
Zhi, Pengpeng
Yang, Song
Meng, Helen
INTERSPEECH 2019, 2019, : 4494 - 4498
[33] End-to-end text-to-speech synthesis with unaligned multiple language units based on attention
Aso, Masashi
Takamichi, Shinnosuke
Saruwatari, Hiroshi
INTERSPEECH 2020, 2020, : 4009 - 4013
[34] Deep-learning based end-to-end system for text reading in the wild
Harizi, Riadh
Walha, Rim
Drira, Fadoua
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (17) : 24691 - 24719
[35] Deep-learning based end-to-end system for text reading in the wild
Riadh Harizi
Rim Walha
Fadoua Drira
Multimedia Tools and Applications, 2022, 81 : 24691 - 24719
[36] Hardware Accelerator for Transformer based End-to-End Automatic Speech Recognition System
Yamini, Shaarada D.
Mirishkar, Ganesh S.
Vuppala, Anil Kumar
Purini, Suresh
2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW, 2023, : 93 - 100
[37] Development of CRF and CTC Based End-To-End Kazakh Speech Recognition System
Oralbekova, Dina
Mamyrbayev, Orken
Othman, Mohamed
Alimhan, Keylan
Zhumazhanov, Bagashar
Nuranbayeva, Bulbul
INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2022, PT I, 2022, 13757 : 519 - 531
[38] End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue
Mitsui, Kentaro
Zhao, Tianyu
Sawada, Kei
Hono, Yukiya
Nankaku, Yoshihiko
Tokuda, Keiichi
INTERSPEECH 2022, 2022, : 2328 - 2332
[39] Adaptive End-to-End Text-to-Speech Synthesis Based on Error Correction Feedback from Humans
Fujii, Kazuki
Saito, Yuki
Saruwatari, Hiroshi
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1702 - 1707
[40] Adaptive End-to-End Text-to-Speech Synthesis Based on Error Correction Feedback from Humans
Fujii, Kazuki
Saito, Yuki
Saruwatari, Hiroshi
Proceedings of 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2022, 2022, : 1702 - 1707

← 1 2 3 4 5 →