A KL Divergence and DNN-based Approach to Voice Conversion without Parallel Training Sentences

被引：50

作者：

Xie, Feng-Long ^{[1
,2
,3
]}

Soong, Frank K. ^{[2
]}

Li, Haifeng ^{[1
]}

机构：

[1] Harbin Inst Technol, Harbin, Heilongjiang, Peoples R China

[2] Microsoft Res Asia, Beijing, Peoples R China

[3] Microsoft Res Asia, Speech Grp, Beijing, Peoples R China

来源：

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年

关键词：

voice conversion; Kullback-Leibler divergence; deep neural networks; ARTIFICIAL NEURAL-NETWORKS;

D O I：

10.21437/Interspeech.2016-116

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We extend our recently proposed approach to cross-lingual TTS training to voice conversion, without using parallel training sentences. It employs Speaker Independent, Deep Neural Net (SI-DNN) ASR to equalize the difference between source and target speakers and Kullback-Leibler Divergence (KLD) to convert spectral parameters probabilistically in the phonetic space via ASR senone posterior probabilities of the two speakers. With or without knowing the transcriptions of the target speaker's training speech, the approach can be either supervised or unsupervised. In a supervised mode, where adequate training data of the target speaker with transcriptions is used to train a GMM-HMM TTS of the target speaker, each frame of the source speakers input data is mapped to the closest senone in thus trained TTS. The mapping is done via the posterior probabilities computed by SI-DNN ASR and the minimum KLD matching. In a unsupervised mode, all training data of the target speaker is first grouped into phonetic clusters where KLD is used as the sole distortion measure. Once the phonetic clusters are trained, each frame of the source speakers input is then mapped to the mean of the closest phonetic cluster. The final converted speech is generated with the max probability trajectory generation algorithm. Both objective and subjective evaluations show the proposed approach can achieve higher speaker similarity and better spectral distortions, when comparing with the baseline system based upon our sequential error minimization trained DNN algorithm.

引用

页码：287 / 291

页数：5

共 50 条

[41] SNR-Based Features and Diverse Training Data for Robust DNN-Based Speech Enhancement
Rehr, Robert
Gerkmann, Timo
IEEE/ACM Transactions on Audio Speech and Language Processing, 2021, 29 : 1937 - 1949
[42] SNR-Based Features and Diverse Training Data for Robust DNN-Based Speech Enhancement
Rehr, Robert
Gerkmann, Timo
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1937 - 1949
[43] Mapping Frames with DNN-HMM Recognizer for Non-parallel Voice Conversion
Dong, Minghui
Yang, Chenyu
Lu, Yanfeng
Ehnes, Jochen Walter
Huang, Dongyan
Ming, Huaiping
Tong, Rong
Lee, Siu Wa
Li, Haizhou
2015 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2015, : 488 - 494
[44] A DNN-based OTFS Transceiver with Delay-Doppler Channel Training and IQI Compensation
Naikoti, Ashwitha
Chockalingam, A.
2021 IEEE 32ND ANNUAL INTERNATIONAL SYMPOSIUM ON PERSONAL, INDOOR AND MOBILE RADIO COMMUNICATIONS (PIMRC), 2021,
[45] DNN-Based Approach to Mitigate Multipath Errors of Differential GNSS Reference Stations
Min, Dongchan
Kim, Minchan
Lee, Jinsil
Circiu, Mihaela Simona
Meurer, Michael
Lee, Jiyun
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (12) : 25047 - 25053
[46] TRAINING ALGORITHM TO DECEIVE ANTI-SPOOFING VERIFICATION FOR DNN-BASED SPEECH SYNTHESIS
Saito, Yuki
Takamichi, Shinnosuke
Saruwatari, Hiroshi
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4900 - 4904
[47] Semi-Supervised Training of DNN-Based Acoustic Model for ATC Speech Recognition
Smidl, Lubos
Svec, Jan
Prazak, Ales
Trmal, Jan
SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 646 - 655
[48] ON USING HETEROGENEOUS DATA FOR VEHICLE-BASED SPEECH RECOGNITION: A DNN-BASED APPROACH
Feng, Xue
Richardson, Brigitte
Amman, Scott
Glass, James
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4385 - 4389
[49] KL-Divergence based Mispronunciation Detection via DNN and Decision Tree in the Phonetic Space
Hu, Wenping
Soong, Frank K.
2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
[50] A DNN-BASED ACOUSTIC MODELING OF TONAL LANGUAGE AND ITS APPLICATION TO MANDARIN PRONUNCIATION TRAINING
Hu, Wenping
Qian, Yao
Soong, Frank K.
2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,

← 1 2 3 4 5 →