Visual-to-Speech Conversion Based on Maximum Likelihood Estimation

被引:0
|
作者
Ra, Rina [1 ]
Aihara, Ryo [1 ]
Takiguchi, Tesuya [1 ]
Ariki, Yasuo [1 ]
机构
[1] Kobe Univ, Grad Sch Syst Informat, Nada Ku, 1-1 Rokkodai, Kobe, Hyogo, Japan
关键词
VOICE CONVERSION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a visual-to-speech conversion method that converts voiceless lip movements into voiced utterances without recognizing text information. Inspired by a Gaussian Mixture Model (GMM)-based voice conversion method, GMM is estimated from jointed visual and audio features and input visual features are converted to audio features using maximum likelihood estimation. In order to capture lip movements whose frame rate data is smaller than the audio data, we construct long-term image features. The proposed method has been evaluated using large-vocabulary continuous speech and experimental results show that our proposed method effectively estimates spectral envelopes and fundamental frequencies of audio speech from voiceless lip movements.
引用
收藏
页码:518 / 521
页数:4
相关论文
共 50 条