TWO-STAGE TRAINING METHOD FOR JAPANESE ELECTROLARYNGEAL SPEECH ENHANCEMENT BASED ON SEQUENCE-TO-SEQUENCE VOICE CONVERSION

被引:2
|
作者
Ma, Ding [1 ]
Violeta, Lester Phillip [1 ]
Kobayashi, Kazuhiro [1 ]
Toda, Tomoki [1 ]
机构
[1] Nagoya Univ, Nagoya, Japan
关键词
sequence-to-sequence voice conversion; electrolaryngeal speech to normal speech; synthetic parallel data; two-stage training;
D O I
10.1109/SLT54892.2023.10023033
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sequence-to-sequence (seq2seq) voice conversion (VC) models have greater potential in converting electrolaryngeal (EL) speech to normal speech (EL2SP) compared to conventional VC models. However, EL2SP based on seq2seq VC requires a sufficiently large amount of parallel data for the model training and it suffers from significant performance degradation when the amount of training data is insufficient. To address this issue, we suggest a novel, two-stage strategy to optimize the performance on EL2SP based on seq2seq VC when a small amount of the parallel dataset is available. In contrast to utilizing high-quality data augmentations in previous studies, we first combine a large amount of imperfect synthetic parallel data of EL and normal speech, with the original dataset into VC training. Then, a second stage training is conducted with the original parallel dataset only. The results show that the proposed method progressively improves the performance of EL2SP based on seq2seq VC.
引用
收藏
页码:949 / 954
页数:6
相关论文
共 50 条
  • [1] MANDARIN ELECTROLARYNGEAL SPEECH VOICE CONVERSION WITH SEQUENCE-TO-SEQUENCE MODELING
    Yen, Ming-Chi
    Huang, Wen-Chin
    Kobayashi, Kazuhiro
    Peng, Yu-Huai
    Tsai, Shu-Wei
    Tsao, Yu
    Toda, Tomoki
    Jang, Jyh-Shing Roger
    Wang, Hsin-Min
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 650 - 657
  • [2] Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training
    Zhou, Kun
    Sisman, Berrak
    Li, Haizhou
    INTERSPEECH 2021, 2021, : 811 - 815
  • [3] Electrolaryngeal speech enhancement based on a two stage framework with bottleneck feature refinement and voice conversion
    Yang, Yaogen
    Zhang, Haozhe
    Cai, Zexin
    Shi, Yao
    Li, Ming
    Zhang, Dong
    Ding, Xiaojun
    Deng, Jianhua
    Wang, Jie
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 80
  • [4] Electrolaryngeal Speech Enhancement Based on Statistical Voice Conversion
    Nakamura, Keigo
    Toda, Tomoki
    Saruwatari, Hiroshi
    Shikano, Kiyohiro
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1443 - 1446
  • [5] Electrolaryngeal Speech Enhancement with Statistical Voice Conversion based on CLDNN
    Kobayashi, Kazuhiro
    Toda, Tomoki
    2018 26TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2018, : 2115 - 2119
  • [6] TWO-STAGE PRE-TRAINING FOR SEQUENCE TO SEQUENCE SPEECH RECOGNITION
    Fan, Zhiyun
    Zhou, Shiyu
    Xu, Bo
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [7] Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining
    Huang, Wen-Chin
    Hayashi, Tomoki
    Wu, Yi-Chiao
    Kameoka, Hirokazu
    Toda, Tomoki
    INTERSPEECH 2020, 2020, : 4676 - 4680
  • [8] Intelligibility Improvement of Esophageal Speech Using Sequence-to-Sequence Voice Conversion with Auditory Attention
    Ezzine, Kadria
    Di Martino, Joseph
    Frikha, Mondher
    APPLIED SCIENCES-BASEL, 2022, 12 (14):
  • [9] Investigation of Text-to-Speech-based Synthetic Parallel Data for Sequence-to-Sequence Non-Parallel Voice Conversion
    Ma, Ding
    Huang, Wen-Chin
    Toda, Tomoki
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 870 - 877
  • [10] A Hybrid Approach to Electrolaryngeal Speech Enhancement Based on Spectral Subtraction and Statistical Voice Conversion
    Tanaka, Kou
    Toda, Tomoki
    Neubig, Graham
    Sakti, Sakriani
    Nakamura, Satoshi
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 3066 - 3070