Multi-Level Temporal-Channel Speaker Retrieval for Zero-Shot Voice Conversion

被引:1
|
作者
Wang, Zhichao [1 ]
Xue, Liumeng [1 ]
Kong, Qiuqiang [2 ]
Xie, Lei [1 ]
Chen, Yuanzhe [2 ]
Tian, Qiao [2 ]
Wang, Yuping [2 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, ASLP Lab, Xian 710072, Peoples R China
[2] ByteDance SAMI Grp, Shanghai 200233, Peoples R China
关键词
Voice conversion; zero-shot; temporal-channel retrieval; attention mechanism; ATTENTION;
D O I
10.1109/TASLP.2024.3407577
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Zero-shot voice conversion (VC) converts source speech into the voice of any desired speaker using only one utterance of the speaker without requiring additional model updates. Typical methods use a speaker representation from a pre-trained speaker verification (SV) model or learn speaker representation during VC training to achieve zero-shot VC. However, existing speaker modeling methods overlook the variation of speaker information richness in temporal and frequency channel dimensions of speech. This insufficient speaker modeling hampers the ability of the VC model to accurately represent unseen speakers who are not in the training dataset. In this study, we present a robust zero-shot VC model with multi-level temporal-channel retrieval, referred to as MTCR-VC. Specifically, to flexibly adapt to the dynamic-variant speaker characteristic in the temporal and channel axis of the speech, we propose a novel fine-grained speaker modeling method, called temporal-channel retrieval (TCR), to find out when and where speaker information appears in speech. It retrieves variable-length speaker representation from both temporal and channel dimensions under the guidance of a pre-trained SV model. Besides, inspired by the hierarchical process of human speech production, the MTCR speaker module stacks several TCR blocks to extract speaker representations from multi-granularity levels. Furthermore, we introduce a cycle-based training strategy to simulate zero-shot inference recurrently to achieve better speech disentanglement and reconstruction. To drive this process, we adopt perceptual constraints on three aspects: content, style, and speaker. Experiments demonstrate that MTCR-VC is superior to the previous zero-shot VC methods in modeling speaker timbre while maintaining good speech naturalness.
引用
收藏
页码:2926 / 2937
页数:12
相关论文
共 50 条
  • [21] Zero-Shot Sketch-Based Remote-Sensing Image Retrieval Based on Multi-Level and Attention-Guided Tokenization
    Yang, Bo
    Wang, Chen
    Ma, Xiaoshuang
    Song, Beiping
    Liu, Zhuang
    Sun, Fangde
    REMOTE SENSING, 2024, 16 (10)
  • [22] END-TO-END ZERO-SHOT VOICE CONVERSION USING A DDSP VOCODER
    Nercessian, Shahan
    2021 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2021, : 306 - 310
  • [23] ROBUST DISENTANGLED VARIATIONAL SPEECH REPRESENTATION LEARNING FOR ZERO-SHOT VOICE CONVERSION
    Lian, Jiachen
    Zhang, Chunlei
    Yu, Dong
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6572 - 6576
  • [24] Towards Unseen Speakers Zero-Shot Voice Conversion with Generative Adversarial Networks
    Lu, Weirui
    Xing, Xiaofen
    Xu, Xiangmin
    Zhang, Weibin
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 854 - 858
  • [25] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
    Kumar, Neeraj
    Narang, Ankur
    Lall, Brejesh
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
  • [26] Zero-shot multi-speaker accent TTS with limited accent data
    Zhang, Mingyang
    Zhou, Yi
    Wu, Zhizheng
    Li, Haizhou
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1931 - 1936
  • [27] ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed
    Chen, Meiying
    Duan, Zhiyao
    INTERSPEECH 2023, 2023, : 2098 - 2102
  • [28] MULTI-LABEL ZERO-SHOT AUDIO CLASSIFICATION WITH TEMPORAL ATTENTION
    Dogan, Duygu
    Xie, Huang
    Heittola, Toni
    Virtanen, Tuomas
    2024 18TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT, IWAENC 2024, 2024, : 250 - 254
  • [29] TRAINING ROBUST ZERO-SHOT VOICE CONVERSION MODELS WITH SELF-SUPERVISED FEATURES
    Trung Dang
    Dung Tran
    Chin, Peter
    Koishida, Kazuhito
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6557 - 6561
  • [30] StreamVoice plus : Evolving Into End-to-End Streaming Zero-Shot Voice Conversion
    Wang, Zhichao
    Chen, Yuanzhe
    Wang, Xinsheng
    Xie, Lei
    Wang, Yuping
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 3000 - 3004