Variable-Length Speaker Conditioning in Flow-Based Text-to-Speech

被引：0

作者：

Choi, Byoung Jin ^{[1
]}

Jeong, Myeonghun ^{[1
]}

Kim, Minchan ^{[1
]}

Kim, Nam Soo ^{[1
]}

机构：

[1] Seoul Natl Univ, Inst New Media & Commun, Dept Elect & Comp Engn, Seoul 08826, South Korea

来源：

IEEE SIGNAL PROCESSING LETTERS | 2024年 / 31卷

关键词：

Speech synthesis; variable-length reference embedding sequence; zero-shot multi-speaker text-to-speech;

D O I：

10.1109/LSP.2024.3377588

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In this letter, we propose a novel speaker conditioning technique that leverages a variable-length reference embedding sequence for flow-based text-to-speech (TTS) architecture in the context of zero-shot multi-speaker text-to-speech (ZSM-TTS). Unlike conventional ZSM-TTS methods, which usually rely on a single fixed-dimensional vector to represent the entire reference speech, our approach aims to extract variable-length embedding sequence for a more flexible and efficient conditioning. We enhance the current affine coupling function in flow-based TTS architecture by introducing an attentive speaker conditioning. This allows a local variation of the speaker conditioning. Our experiments demonstrate the effectiveness of the proposed method, highlighting improvements in terms of speaker similarity, speech naturalness, and speech intelligibility compared to the baseline methods.

引用

页码：899 / 903

页数：5

共 50 条

[31] ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis
Xue, Jinlong
Deng, Yayue
Han, Yichen
Li, Ya
Sun, Jianqing
Liang, Jiaen
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 230 - 234
[32] Adapter-Based Extension of Multi-Speaker Text-To-Speech Model for New Speakers
Hsieh, Cheng-Ping
Ghosh, Subhankar
Ginsburg, Boris
INTERSPEECH 2023, 2023, : 3028 - 3032
[33] HMM-based distributed text-to-speech synthesis incorporating speaker-adaptive training
Jeon, Kwang Myung
Choi, Seung Ho
International Journal of Multimedia and Ubiquitous Engineering, 2014, 9 (05): : 107 - 119
[34] DeepMine-multi-TTS: a Persian speech corpus for multi-speaker text-to-speech
Adibian, Majid
Zeinali, Hossein
Barmaki, Soroush
LANGUAGE RESOURCES AND EVALUATION, 2025,
[35] Controlling formant frequencies with neural text-to-speech for the manipulation of perceived speaker age
Khan, Ziya
Wihlborg, Lovisa
Valentini-Botinhao, Cassia
Watts, Oliver
INTERSPEECH 2023, 2023, : 4359 - 4363
[36] CROSS-SPEAKER STYLE TRANSFER FOR TEXT-TO-SPEECH USING DATA AUGMENTATION
Ribeiro, Manuel Sam
Roth, Julian
Comini, Giulia
Huybrechts, Goeric
Gabrys, Adam
Lorenzo-Trueba, Jaime
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6797 - 6801
[37] LIGHTSPEECH: LIGHTWEIGHT NON-AUTOREGRESSIVE MULTI-SPEAKER TEXT-TO-SPEECH
Li, Song
Ouyang, Beibei
Li, Lin
Hong, Qingyang
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 499 - 506
[38] Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
Mitsui, Kentaro
Koriyama, Tomoki
Saruwatari, Hiroshi
INTERSPEECH 2020, 2020, : 2032 - 2036
[39] Speaker Specific Phrase Break Modeling with Conditional Random Fields for Text-to-Speech
Louw, Johannes A.
Moodley, Avashlin
2016 PATTERN RECOGNITION ASSOCIATION OF SOUTH AFRICA AND ROBOTICS AND MECHATRONICS INTERNATIONAL CONFERENCE (PRASA-ROBMECH), 2016,
[40] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
Chen, Mengnan
Chen, Minchuan
Liang, Shuang
Ma, Jun
Chen, Lei
Wang, Shaojun
Xiao, Jing
INTERSPEECH 2019, 2019, : 2105 - 2109

← 1 2 3 4 5 →