HMM trajectory-guided sample selection for photo-realistic talking head

被引：0

作者：

Lijuan Wang

Frank K. Soong

机构：

[1] Microsoft Research Asia,

来源：

Multimedia Tools and Applications | 2015年 / 74卷

关键词：

Visual speech synthesis; Photo- realistic; Talking head; Trajectory-guided sample selection;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

In this paper, we propose an HMM trajectory-guided, real image sample concatenation approach to photo-realistic talking head synthesis. An audio-visual database of a person is recorded first for training a statistical Hidden Markov Model (HMM) of Lips movement. The HMM is then used to generate the dynamic trajectory of lips movement for given speech signals in the maximum probability sense. The generated trajectory is then used as a guide to select, from the original training database, an optimal sequence of lips images which are then stitched back to a background head video. We also propose a minimum generation error (MGE) training method to refine the audio-visual HMM to improve visual speech trajectory synthesis. Compared with the traditional maximum likelihood (ML) estimation, the proposed MGE training explicitly optimizes the quality of generated visual speech trajectory, where the audio-visual HMM modeling is jointly refined by using a heuristic method to find the optimal state alignment and a probabilistic descent algorithm to optimize the model parameters under the MGE criterion. In objective evaluation, compared with the ML-based method, the proposed MGE-based method achieves consistent improvement in the mean square error reduction, correlation increase, and recovery of global variance. For as short as 20 min recording of audio/video footage, the proposed system can synthesize a highly photo-realistic talking head in sync with the given speech signals (natural or TTS synthesized). This system won the first place in the A/V consistency contest in LIPS Challenge, perceptually evaluated by recruited human subjects.

引用

页码：9849 / 9869

页数：20

共 19 条

[1] HMM trajectory-guided sample selection for photo-realistic talking head
Wang, Lijuan
Soong, Frank K.
MULTIMEDIA TOOLS AND APPLICATIONS, 2015, 74 (22) : 9849 - 9869
[2] Synthesizing Photo-Real Talking Head via Trajectory-Guided Sample Selection
Wang, Lijuan
Qian, Xiaojun
Han, Wei
Soong, Frank K.
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 446 - 449
[3] Photo-Realistic Expressive Text to Talking Head Synthesis
Wan, Vincent
Anderson, Robert
Blokland, Art
Braunschweiler, Norbert
Chen, Langzhou
Kolluru, BalaKrishna
Latorre, Javier
Maia, Ranniery
Stenger, Bjoern
Yanagisawa, Kayoko
Stylianou, Yannis
Akamine, Masami
Gales, Mark J. F.
Cipolla, Roberto
14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2666 - 2668
[4] Sample-based synthesis of photo-realistic talking heads
Cosatto, E
Graf, HP
COMPUTER ANIMATION 98 - PROCEEDINGS, 1998, : 103 - 110
[5] Text Driven 3D Photo-Realistic Talking Head
Wang, Lijuan
Han, Wei
Soong, Frank K.
Huo, Qiang
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 3314 - 3315
[6] A New Language Independent, Photo-realistic Talking Head Driven by Voice Only
Zhang, Xinjian
Wang, Lijuan
Li, Gang
Seide, Frank
Soong, Frank K.
14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2742 - 2746
[7] Photo-realistic Text-driven Malay talking head with multiple expression
Tan, Tian-Swee
Salleh, Sh-Hussain
Chew, Kim-Mey
Lim, Sheau-Chyi
2008 INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION ENGINEERING, VOLS 1-3, 2008, : 711 - 715
[8] 3D Photo-realistic talking head for human-robot interaction
Simplicio, Carlos
Faria, Diego R.
Dias, Jorge
VIRTUAL AND RAPID MANUFACTURING: ADVANCED RESEARCH IN VIRTUAL AND RAPID PROTOTYPING, 2008, : 677 - +
[9] Audio-visual unit selection for the synthesis of photo-realistic talking-heads
Cosatto, E
Potamianos, G
Graf, HP
2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 619 - 622
[10] Photo-Realistic Talking-Heads from Image Samples
Cosatto, Eric
Graf, Hans Peter
IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) : 152 - 163

← 1 2 →