A KL Divergence and DNN-based Approach to Voice Conversion without Parallel Training Sentences

被引：50

作者：

Xie, Feng-Long ^{[1
,2
,3
]}

Soong, Frank K. ^{[2
]}

Li, Haifeng ^{[1
]}

机构：

[1] Harbin Inst Technol, Harbin, Heilongjiang, Peoples R China

[2] Microsoft Res Asia, Beijing, Peoples R China

[3] Microsoft Res Asia, Speech Grp, Beijing, Peoples R China

来源：

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年

关键词：

voice conversion; Kullback-Leibler divergence; deep neural networks; ARTIFICIAL NEURAL-NETWORKS;

D O I：

10.21437/Interspeech.2016-116

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We extend our recently proposed approach to cross-lingual TTS training to voice conversion, without using parallel training sentences. It employs Speaker Independent, Deep Neural Net (SI-DNN) ASR to equalize the difference between source and target speakers and Kullback-Leibler Divergence (KLD) to convert spectral parameters probabilistically in the phonetic space via ASR senone posterior probabilities of the two speakers. With or without knowing the transcriptions of the target speaker's training speech, the approach can be either supervised or unsupervised. In a supervised mode, where adequate training data of the target speaker with transcriptions is used to train a GMM-HMM TTS of the target speaker, each frame of the source speakers input data is mapped to the closest senone in thus trained TTS. The mapping is done via the posterior probabilities computed by SI-DNN ASR and the minimum KLD matching. In a unsupervised mode, all training data of the target speaker is first grouped into phonetic clusters where KLD is used as the sole distortion measure. Once the phonetic clusters are trained, each frame of the source speakers input is then mapped to the mean of the closest phonetic cluster. The final converted speech is generated with the max probability trajectory generation algorithm. Both objective and subjective evaluations show the proposed approach can achieve higher speaker similarity and better spectral distortions, when comparing with the baseline system based upon our sequential error minimization trained DNN algorithm.

引用

页码：287 / 291

页数：5

共 50 条

[21] Resisting DNN-Based Website Fingerprinting Attacks Enhanced by Adversarial Training
Qiao, Litao
Wu, Bang
Yin, Shuijun
Li, Heng
Yuan, Wei
Luo, Xiapu
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2023, 18 : 5375 - 5386
[22] Towards minimum perceptual error training for DNN-based speech synthesis
Valentini-Botinhao, Cassia
Wu, Zhizheng
King, Simon
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 869 - 873
[23] NON-PARALLEL TRAINING FOR VOICE CONVERSION BASED ON ADAPTATION METHOD
Song, Peng
Zheng, Wenming
Zhao, Li
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 6905 - 6909
[24] DNN-Based Speech Synthesis: Importance of Input Features and Training Data
Lazaridis, Alexandros
Potard, Blaise
Garner, Philip N.
SPEECH AND COMPUTER (SPECOM 2015), 2015, 9319 : 193 - 200
[25] Exploring redundancy of HRTFs for fast training DNN-based HRTF personalization
Chen, Tzu-Yu
Hsiao, Po-Wen
Chi, Tai-Shih
2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 1929 - 1933
[26] ARVC: An Auto-Regressive Voice Conversion System Without Parallel Training Data
Lian, Zheng
Wen, Zhengqi
Zhou, Xinyong
Pu, Songbai
Zhang, Shengkai
Tao, Jianhua
INTERSPEECH 2020, 2020, : 4706 - 4710
[27] PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING
Sun, Lifa
Li, Kun
Wang, Hao
Kang, Shiyin
Meng, Helen
2016 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME), 2016,
[28] Many-to-Many and Completely Parallel-Data-Free Voice Conversion Based on Eigenspace DNN
Hashimoto, Tetsuya
Saito, Daisuke
Minematsu, Nobuaki
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (02) : 332 - 341
[29] DNN-BASED VOICE ACTIVITY DETECTION USING AUXILIARY SPEECH MODELS IN NOISY ENVIRONMENTS
Tachioka, Yuuki
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5529 - 5533
[30] Nonparallel training for voice conversion based on a parameter adaptation approach
Mouchtaris, A
Van der Spiegel, J
Mueller, P
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (03): : 952 - 963

← 1 2 3 4 5 →