HMM trajectory-guided sample selection for photo-realistic talking head

被引:0
|
作者
Lijuan Wang
Frank K. Soong
机构
[1] Microsoft Research Asia,
来源
关键词
Visual speech synthesis; Photo- realistic; Talking head; Trajectory-guided sample selection;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper, we propose an HMM trajectory-guided, real image sample concatenation approach to photo-realistic talking head synthesis. An audio-visual database of a person is recorded first for training a statistical Hidden Markov Model (HMM) of Lips movement. The HMM is then used to generate the dynamic trajectory of lips movement for given speech signals in the maximum probability sense. The generated trajectory is then used as a guide to select, from the original training database, an optimal sequence of lips images which are then stitched back to a background head video. We also propose a minimum generation error (MGE) training method to refine the audio-visual HMM to improve visual speech trajectory synthesis. Compared with the traditional maximum likelihood (ML) estimation, the proposed MGE training explicitly optimizes the quality of generated visual speech trajectory, where the audio-visual HMM modeling is jointly refined by using a heuristic method to find the optimal state alignment and a probabilistic descent algorithm to optimize the model parameters under the MGE criterion. In objective evaluation, compared with the ML-based method, the proposed MGE-based method achieves consistent improvement in the mean square error reduction, correlation increase, and recovery of global variance. For as short as 20 min recording of audio/video footage, the proposed system can synthesize a highly photo-realistic talking head in sync with the given speech signals (natural or TTS synthesized). This system won the first place in the A/V consistency contest in LIPS Challenge, perceptually evaluated by recruited human subjects.
引用
收藏
页码:9849 / 9869
页数:20
相关论文
共 19 条
  • [1] HMM trajectory-guided sample selection for photo-realistic talking head
    Wang, Lijuan
    Soong, Frank K.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2015, 74 (22) : 9849 - 9869
  • [2] Synthesizing Photo-Real Talking Head via Trajectory-Guided Sample Selection
    Wang, Lijuan
    Qian, Xiaojun
    Han, Wei
    Soong, Frank K.
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 446 - 449
  • [3] Photo-Realistic Expressive Text to Talking Head Synthesis
    Wan, Vincent
    Anderson, Robert
    Blokland, Art
    Braunschweiler, Norbert
    Chen, Langzhou
    Kolluru, BalaKrishna
    Latorre, Javier
    Maia, Ranniery
    Stenger, Bjoern
    Yanagisawa, Kayoko
    Stylianou, Yannis
    Akamine, Masami
    Gales, Mark J. F.
    Cipolla, Roberto
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2666 - 2668
  • [4] Sample-based synthesis of photo-realistic talking heads
    Cosatto, E
    Graf, HP
    COMPUTER ANIMATION 98 - PROCEEDINGS, 1998, : 103 - 110
  • [5] Text Driven 3D Photo-Realistic Talking Head
    Wang, Lijuan
    Han, Wei
    Soong, Frank K.
    Huo, Qiang
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 3314 - 3315
  • [6] A New Language Independent, Photo-realistic Talking Head Driven by Voice Only
    Zhang, Xinjian
    Wang, Lijuan
    Li, Gang
    Seide, Frank
    Soong, Frank K.
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2742 - 2746
  • [7] Photo-realistic Text-driven Malay talking head with multiple expression
    Tan, Tian-Swee
    Salleh, Sh-Hussain
    Chew, Kim-Mey
    Lim, Sheau-Chyi
    2008 INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION ENGINEERING, VOLS 1-3, 2008, : 711 - 715
  • [8] 3D Photo-realistic talking head for human-robot interaction
    Simplicio, Carlos
    Faria, Diego R.
    Dias, Jorge
    VIRTUAL AND RAPID MANUFACTURING: ADVANCED RESEARCH IN VIRTUAL AND RAPID PROTOTYPING, 2008, : 677 - +
  • [9] Audio-visual unit selection for the synthesis of photo-realistic talking-heads
    Cosatto, E
    Potamianos, G
    Graf, HP
    2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 619 - 622
  • [10] Photo-Realistic Talking-Heads from Image Samples
    Cosatto, Eric
    Graf, Hans Peter
    IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) : 152 - 163