A KL Divergence and DNN-based Approach to Voice Conversion without Parallel Training Sentences

被引：50

作者：

Xie, Feng-Long ^{[1
,2
,3
]}

Soong, Frank K. ^{[2
]}

Li, Haifeng ^{[1
]}

机构：

[1] Harbin Inst Technol, Harbin, Heilongjiang, Peoples R China

[2] Microsoft Res Asia, Beijing, Peoples R China

[3] Microsoft Res Asia, Speech Grp, Beijing, Peoples R China

来源：

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年

关键词：

voice conversion; Kullback-Leibler divergence; deep neural networks; ARTIFICIAL NEURAL-NETWORKS;

D O I：

10.21437/Interspeech.2016-116

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We extend our recently proposed approach to cross-lingual TTS training to voice conversion, without using parallel training sentences. It employs Speaker Independent, Deep Neural Net (SI-DNN) ASR to equalize the difference between source and target speakers and Kullback-Leibler Divergence (KLD) to convert spectral parameters probabilistically in the phonetic space via ASR senone posterior probabilities of the two speakers. With or without knowing the transcriptions of the target speaker's training speech, the approach can be either supervised or unsupervised. In a supervised mode, where adequate training data of the target speaker with transcriptions is used to train a GMM-HMM TTS of the target speaker, each frame of the source speakers input data is mapped to the closest senone in thus trained TTS. The mapping is done via the posterior probabilities computed by SI-DNN ASR and the minimum KLD matching. In a unsupervised mode, all training data of the target speaker is first grouped into phonetic clusters where KLD is used as the sole distortion measure. Once the phonetic clusters are trained, each frame of the source speakers input is then mapped to the mean of the closest phonetic cluster. The final converted speech is generated with the max probability trajectory generation algorithm. Both objective and subjective evaluations show the proposed approach can achieve higher speaker similarity and better spectral distortions, when comparing with the baseline system based upon our sequential error minimization trained DNN algorithm.

引用

页码：287 / 291

页数：5

共 50 条

[31] DNN-based approach for fault detection in a direct drive wind turbine
Teng, Wei
Cheng, Hao
Ding, Xian
Liu, Yibing
Ma, Zhiyong
Mu, Haihua
IET RENEWABLE POWER GENERATION, 2018, 12 (10) : 1164 - 1171
[32] Adaptation of DNN Acoustic Models Using KL-divergence Regularization and Multi-task Training
Toth, Laszlo
Gosztolya, Gabor
SPEECH AND COMPUTER, 2016, 9811 : 108 - 115
[33] Deep Spread Multiplexing and Study of Training Methods for DNN-Based Encoder and Decoder
Kim, Minhoe
Lee, Woongsup
SENSORS, 2023, 23 (08)
[34] Voice Conversion without Parallel Speech Corpus Based on Mixtures of Linear Transform
Jian, Zhi-Hua
Yang, Zhen
2007 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-15, 2007, : 2825 - 2828
[35] DeepConversion: Voice conversion with limited parallel training data
Zhang, Mingyang
Sisman, Berrak
Zhao, Li
Li, Haizhou
SPEECH COMMUNICATION, 2020, 122 : 31 - 43
[36] NON-PARALLEL TRAINING FOR VOICE CONVERSION BASED ON FT-GMM
Chen, Ling-Hui
Ling, Zhen-Hua
Dai, Li-Rong
2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 5116 - 5119
[37] BATCH-NORMALIZED JOINT TRAINING FOR DNN-BASED DISTANT SPEECH RECOGNITION
Ravanelli, Mirco
Brakel, Philemon
Omologo, Maurizio
Bengio, Yoshua
2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 28 - 34
[38] ENVIRONMENT AWARE SPEAKER DIARIZATION FOR MOVING TARGETS USING PARALLEL DNN-BASED RECOGNIZERS
Najafian, Maryam
Hansen, John H. L.
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5450 - 5454
[39] DNN-based grapheme-to-phoneme conversion for Arabic text-to-speech synthesis
Ikbel Hadj Ali
Zied Mnasri
Zied Lachiri
International Journal of Speech Technology, 2020, 23 : 569 - 584
[40] DNN-based grapheme-to-phoneme conversion for Arabic text-to-speech synthesis
Ali, Ikbel Hadj
Mnasri, Zied
Lachiri, Zied
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2020, 23 (03) : 569 - 584

← 1 2 3 4 5 →