WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

被引:2
|
作者
Rekimoto, Jun [1 ,2 ]
机构
[1] Univ Tokyo, 7-3-1 Hongo,Bunkyo Ku, Tokyo, Japan
[2] Sony Comp Sci Labs Kyoto, 13-1 Hontoro Cho,Shimogyo Ku, Kyoto, Kyoto, Japan
关键词
speech interaction; whispered voice; whispered voice conversion; silent speech; artificial intelligence; neural networks; self-supervised learning;
D O I
10.1145/3544548.3580706
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech interaction in public places without being audible to others. Converting whispers to normal speech also improves the speech quality for people with speech or hearing impairments. However, conventional speech conversion techniques do not provide sufficient conversion quality or require speaker-dependent datasets consisting of pairs of whispered and normal speech utterances. To address these problems, we propose WESPER, a zero-shot, real-time whisper-to-normal speech conversion mechanism based on self-supervised learning. WESPER consists of a speech-to-unit (STU) encoder, which generates hidden speech units common to both whispered and normal speech, and a unit-to-speech ( UTS) decoder, which reconstructs speech from the encoded speech units. Unlike the existing methods, this conversion is user-independent and does not require a paired dataset for whispered and normal speech. The UTS decoder can reconstruct speech in any target speaker's voice from speech units, and it requires only an unlabeled target speaker's speech data. We confirmed that the quality of the speech converted from a whisper was improved while preserving its natural prosody. Additionally, we confirmed the effectiveness of the proposed approach to perform speech reconstruction for people with speech or hearing disabilities.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Whisper to Normal Speech Based on Deep Neural Networks with MCC and F0 Features
    Lian, Hailun
    Hu, Yuting
    Zhou, Jian
    Wang, Huabin
    Tao, Liang
    2018 IEEE 23RD INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2018,
  • [32] ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed
    Chen, Meiying
    Duan, Zhiyao
    INTERSPEECH 2023, 2023, : 2098 - 2102
  • [33] TRAINING ROBUST ZERO-SHOT VOICE CONVERSION MODELS WITH SELF-SUPERVISED FEATURES
    Trung Dang
    Dung Tran
    Chin, Peter
    Koishida, Kazuhito
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6557 - 6561
  • [34] StreamVoice plus : Evolving Into End-to-End Streaming Zero-Shot Voice Conversion
    Wang, Zhichao
    Chen, Yuanzhe
    Wang, Xinsheng
    Xie, Lei
    Wang, Yuping
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 3000 - 3004
  • [35] CA-VC: A Novel Zero-Shot Voice Conversion Method With Channel Attention
    Xiao, Ruitong
    Xing, Xiaofen
    Yang, Jichen
    Xu, Xiangmin
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 800 - 807
  • [36] End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions
    Kang, Wonjune
    Hasegawa-Johnson, Mark
    Roy, Deb
    INTERSPEECH 2023, 2023, : 2303 - 2307
  • [37] Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning
    Wang, Shijun
    Borth, Damian
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [38] Utilizing Adaptive Global Response Normalization and Cluster-Based Pseudo Labels for Zero-Shot Voice Conversion
    Um, Ji Sub
    Kim, Hoirin
    INTERSPEECH 2024, 2024, : 2740 - 2744
  • [39] Multi-Level Temporal-Channel Speaker Retrieval for Zero-Shot Voice Conversion
    Wang, Zhichao
    Xue, Liumeng
    Kong, Qiuqiang
    Xie, Lei
    Chen, Yuanzhe
    Tian, Qiao
    Wang, Yuping
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2926 - 2937
  • [40] Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion
    Kang, Xiao
    Huang, Hao
    Hu, Ying
    Huang, Zhihua
    DIGITAL SIGNAL PROCESSING, 2021, 116