Multi-Level Temporal-Channel Speaker Retrieval for Zero-Shot Voice Conversion

被引：1

作者：

Wang, Zhichao ^{[1
]}

Xue, Liumeng ^{[1
]}

Kong, Qiuqiang ^{[2
]}

Xie, Lei ^{[1
]}

Chen, Yuanzhe ^{[2
]}

Tian, Qiao ^{[2
]}

Wang, Yuping ^{[2
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, ASLP Lab, Xian 710072, Peoples R China

[2] ByteDance SAMI Grp, Shanghai 200233, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Voice conversion; zero-shot; temporal-channel retrieval; attention mechanism; ATTENTION;

D O I：

10.1109/TASLP.2024.3407577

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Zero-shot voice conversion (VC) converts source speech into the voice of any desired speaker using only one utterance of the speaker without requiring additional model updates. Typical methods use a speaker representation from a pre-trained speaker verification (SV) model or learn speaker representation during VC training to achieve zero-shot VC. However, existing speaker modeling methods overlook the variation of speaker information richness in temporal and frequency channel dimensions of speech. This insufficient speaker modeling hampers the ability of the VC model to accurately represent unseen speakers who are not in the training dataset. In this study, we present a robust zero-shot VC model with multi-level temporal-channel retrieval, referred to as MTCR-VC. Specifically, to flexibly adapt to the dynamic-variant speaker characteristic in the temporal and channel axis of the speech, we propose a novel fine-grained speaker modeling method, called temporal-channel retrieval (TCR), to find out when and where speaker information appears in speech. It retrieves variable-length speaker representation from both temporal and channel dimensions under the guidance of a pre-trained SV model. Besides, inspired by the hierarchical process of human speech production, the MTCR speaker module stacks several TCR blocks to extract speaker representations from multi-granularity levels. Furthermore, we introduce a cycle-based training strategy to simulate zero-shot inference recurrently to achieve better speech disentanglement and reconstruction. To drive this process, we adopt perceptual constraints on three aspects: content, style, and speaker. Experiments demonstrate that MTCR-VC is superior to the previous zero-shot VC methods in modeling speaker timbre while maintaining good speech naturalness.

引用

页码：2926 / 2937

页数：12

共 50 条

[31] End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions
Kang, Wonjune
Hasegawa-Johnson, Mark
Roy, Deb
INTERSPEECH 2023, 2023, : 2303 - 2307
[32] Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning
Wang, Shijun
Borth, Damian
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[33] ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH WITH STATE-OF-THE-ART NEURAL SPEAKER EMBEDDINGS
Cooper, Erica
Lai, Cheng-, I
Yasuda, Yusuke
Fang, Fuming
Wang, Xin
Chen, Nanxin
Yamagishi, Junichi
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6184 - 6188
[34] Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment
Sheng, Zheng-Yan
Ai, Yang
Chen, Yan-Nian
Ling, Zhen-Hua
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8443 - 8452
[35] Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation
Choi, Ha-Yeong
Lee, Sang-Hoon
Lee, Seong-Whan
INTERSPEECH 2023, 2023, : 2283 - 2287
[36] Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis
Zhang, Mingyang
Zhou, Xuehao
Wu, Zhizheng
Li, Haizhou
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 947 - 951
[37] SLMGAN: EXPLOITING SPEECH LANGUAGE MODEL REPRESENTATIONS FOR UNSUPERVISED ZERO-SHOT VOICE CONVERSION IN GANS
Li, Yinghao Aaron
Han, Cong
Mesgarani, Nima
2023 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, WASPAA, 2023,
[38] Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
Yoon, Hyungchan
Kim, Changhwan
Song, Eunwoo
Yoon, Hyun-Wook
Kang, Hong-Goo
INTERSPEECH 2023, 2023, : 4299 - 4303
[39] NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH
Zhao, Botao
Zhang, Xulong
Wang, Jianzong
Cheng, Ning
Xiao, Jing
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4293 - 4297
[40] Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder
Bang, Chae-Woon
Chun, Chanjun
SENSORS, 2023, 23 (23)

← 1 2 3 4 5 →