A KL Divergence and DNN-based Approach to Voice Conversion without Parallel Training Sentences

被引:50
|
作者
Xie, Feng-Long [1 ,2 ,3 ]
Soong, Frank K. [2 ]
Li, Haifeng [1 ]
机构
[1] Harbin Inst Technol, Harbin, Heilongjiang, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
[3] Microsoft Res Asia, Speech Grp, Beijing, Peoples R China
关键词
voice conversion; Kullback-Leibler divergence; deep neural networks; ARTIFICIAL NEURAL-NETWORKS;
D O I
10.21437/Interspeech.2016-116
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We extend our recently proposed approach to cross-lingual TTS training to voice conversion, without using parallel training sentences. It employs Speaker Independent, Deep Neural Net (SI-DNN) ASR to equalize the difference between source and target speakers and Kullback-Leibler Divergence (KLD) to convert spectral parameters probabilistically in the phonetic space via ASR senone posterior probabilities of the two speakers. With or without knowing the transcriptions of the target speaker's training speech, the approach can be either supervised or unsupervised. In a supervised mode, where adequate training data of the target speaker with transcriptions is used to train a GMM-HMM TTS of the target speaker, each frame of the source speakers input data is mapped to the closest senone in thus trained TTS. The mapping is done via the posterior probabilities computed by SI-DNN ASR and the minimum KLD matching. In a unsupervised mode, all training data of the target speaker is first grouped into phonetic clusters where KLD is used as the sole distortion measure. Once the phonetic clusters are trained, each frame of the source speakers input is then mapped to the mean of the closest phonetic cluster. The final converted speech is generated with the max probability trajectory generation algorithm. Both objective and subjective evaluations show the proposed approach can achieve higher speaker similarity and better spectral distortions, when comparing with the baseline system based upon our sequential error minimization trained DNN algorithm.
引用
收藏
页码:287 / 291
页数:5
相关论文
共 50 条
  • [1] Voice conversion with SI-DNN and KL divergence based mapping without parallel training data
    Xie, Feng-Long
    Soong, Frank K.
    Li, Haifeng
    SPEECH COMMUNICATION, 2019, 106 : 57 - 67
  • [2] DNN-based Approach to Detect and Classify Pathological Voice
    Chuang, Zong-Ying
    Yu, Xiao-Tong
    Chen, Ji-Ying
    Hsu, Yi-Te
    Xu, Zhe-Zhuang
    Wang, Chi-Te
    Lin, Feng-Chuan
    Fang, Shih-Hau
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 5238 - 5241
  • [3] On the Training of DNN-based Average Voice Model for Speech Synthesis
    Yang, Shan
    Wu, Zhizheng
    Xie, Lei
    2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
  • [4] DNN-Based Cross-Lingual Voice Conversion Using Bottleneck Features
    M. Kiran Reddy
    K. Sreenivasa Rao
    Neural Processing Letters, 2020, 51 : 2029 - 2042
  • [5] DNN-Based Cross-Lingual Voice Conversion Using Bottleneck Features
    Reddy, M. Kiran
    Rao, K. Sreenivasa
    NEURAL PROCESSING LETTERS, 2020, 51 (02) : 2029 - 2042
  • [6] DNN-Based Duration Modeling for Synthesizing Short Sentences
    Nagy, Peter
    Nemeth, Geza
    Speech and Computer, 2016, 9811 : 254 - 261
  • [7] Unsupervised Training of a DNN-based Formant Tracker
    Lilley, Jason
    Bunnell, H. Timothy
    INTERSPEECH 2021, 2021, : 1189 - 1193
  • [8] A KL DIVERGENCE AND DNN APPROACH TO CROSS-LINGUAL TTS
    Xie, Feng-Long
    Soong, Frank K.
    Li, Haifeng
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5515 - 5519
  • [9] Frame Selection in SI-DNN Phonetic Space with WaveNet Vocoder for Voice Conversion without Parallel Training Data
    Xie, Feng-Long
    Soong, Frank K.
    Wang, Xi
    He, Lei
    Li, Haifeng
    2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 56 - 60
  • [10] DNN-based Voice Conversion with Auxiliary Phonemic Information to Improve Intelligibility of Glossectomy Patients' Speech
    Murakami, Hiroki
    Hara, Sunao
    Abe, Masanobu
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 138 - 142