ZERO-SHOT TEXT-TO-SPEECH SYNTHESIS CONDITIONED USING SELF-SUPERVISED SPEECH REPRESENTATION MODEL

被引：1

作者：

Fujita, Kenichi ^{[1
]}

Ashihara, Takanori ^{[1
]}

Kanagawa, Hiroki ^{[1
]}

Moriya, Takafumi ^{[1
]}

Ijima, Yusuke ^{[1
]}

机构：

[1] NTT Corp, Tokyo, Japan

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW | 2023年

关键词：

Speech synthesis; self-supervised learning model; speaker embeddings; zero-shot TTS;

D O I：

10.1109/ICASSPW59220.2023.10193459

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL). Conventional methods with embedding vectors from x-vector or global style tokens still have a gap in reproducing the speaker characteristics of unseen speakers. A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data. We also introduce the separate conditioning of acoustic features and a phoneme duration predictor to obtain the disentangled embeddings between rhythm-based speaker characteristics and acoustic-feature-based ones. The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches. Objective and subjective evaluations showed that the proposed method can synthesize speech with improved similarity and achieve speech-rhythm transfer.

引用

页数：5

共 50 条

[41] Refining Self-Supervised Learnt Speech Representation using Brain Activations
Li, Hengyu
Mei, Kangdi
Liu, Zhaoci
Ai, Yang
Chen, Liping
Zhang, Jie
Ling, Zhenhua
INTERSPEECH 2024, 2024, : 1480 - 1484
[42] Self-Supervised Learning With Segmental Masking for Speech Representation
Yue, Xianghu
Lin, Jingru
Gutierrez, Fabian Ritter
Li, Haizhou
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1367 - 1379
[43] Phonetically Motivated Self-Supervised Speech Representation Learning
Yue, Xianghu
Li, Haizhou
INTERSPEECH 2021, 2021, : 746 - 750
[44] Information Retrieval from Alternative Data using Zero-Shot Self-Supervised Learning
Assareh, Amin
2022 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE FOR FINANCIAL ENGINEERING AND ECONOMICS (CIFER), 2022,
[45] Zero-Shot Dialogue Disentanglement by Self-Supervised Entangled Response Selection
Chi, Ta-Chung
Rudnicky, Alexander, I
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 4897 - 4902
[46] A stochastic model of intonation for text-to-speech synthesis
Véronis, J
Di Cristo, P
Courtois, F
Chaumette, C
SPEECH COMMUNICATION, 1998, 26 (04) : 233 - 244
[47] A prosodic model for text-to-speech synthesis in French
Di Cristo, A
Di Cristo, P
Campione, E
Véronis, J
INTONATION: ANALYSIS, MODELLING AND TECHNOLOGY, 2000, 15 : 321 - 355
[48] Self-Supervised Knowledge Triplet Learning for Zero-Shot Question Answering
Banerjee, Pratyay
Baral, Chitta
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 151 - 162
[49] Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion
Lei, Yi
Yang, Shan
Cong, Jian
Xie, Lei
Su, Dan
INTERSPEECH 2022, 2022, : 2563 - 2567
[50] Self-supervised Context-aware Style Representation for Expressive Speech Synthesis
Wu, Yihan
Wang, Xi
Zhang, Shaofei
He, Lei
Song, Ruihua
Nie, Jian-Yun
INTERSPEECH 2022, 2022, : 5503 - 5507

← 1 2 3 4 5 →