CorrTalk: Correlation Between Hierarchical Speech and Facial Activity Variances for 3D Animation

被引：0

作者：

Chu, Zhaojie ^{[1
]}

Guo, Kailing ^{[1
,2
]}

Xing, Xiaofen ^{[1
]}

Lan, Yilin ^{[3
]}

Cai, Bolun ^{[4
]}

Xu, Xiangmin ^{[2
,3
,5
]}

机构：

[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510640, Peoples R China

[2] Pazhou Lab, Guangzhou 510335, Peoples R China

[3] South China Univ Technol, Sch Future Technol, Guangzhou 510640, Peoples R China

[4] ByteDance Inc, Shenzhen 518000, Peoples R China

[5] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230088, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 09期

关键词：

3D facial animation; hierarchical speech features; 3D talking head; facial activity variance; transformer; NETWORK;

D O I：

10.1109/TCSVT.2024.3386836

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Speech-driven 3D facial animation is a challenging cross-modal task that has attracted growing research interest. During speaking activities, the mouth displays strong motions, while the other facial regions typically demonstrate comparatively weak activity levels. Existing approaches often simplify the process by directly mapping single-level speech features to the entire facial animation, which overlook the differences in facial activity intensity leading to overly smoothed facial movements. In this study, we propose a novel framework, CorrTalk, which effectively establishes the temporal correlation between hierarchical speech features and facial activities of different intensities across distinct regions. A novel facial activity intensity prior is defined to distinguish between strong and weak facial activity, obtained by statistically analyzing facial animations. Based on the facial activity intensity prior, we propose a dual-branch decoding framework to synchronously synthesize strong and weak facial activity, which guarantees wider intensity facial animation synthesis. Furthermore, a weighted hierarchical feature encoder is proposed to establish temporal correlation between hierarchical speech features and facial activity at different intensities, which ensures lip-sync and plausible facial expressions. Extensive qualitatively and quantitatively experiments as well as a user study indicate that our CorrTalk outperforms existing state-of-the-art methods. The source code and supplementary video are publicly available at: https://zjchu.github.io/projects/CorrTalk/.

引用

页码：8953 / 8965

页数：13

共 50 条

[21] 3D facial animation driven by speech-video dual-modal signals
Ji, Xuejie
Liao, Zhouzhou
Dong, Lanfang
Tang, Yingchao
Li, Guoming
Mao, Meng
COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (05) : 5951 - 5964
[22] A review regarding the 3D facial animation pipeline
de Carvalho Cruz, Artur Tavares
Teixeira, Joao Marcelo
PROCEEDINGS OF SYMPOSIUM ON VIRTUAL AND AUGMENTED REALITY, SVR 2021, 2021, : 192 - 196
[23] 3D facial animation based on texture mapping
Din-Chang, Tseng
Chang-Yang Lu
Shu-Chen Wei
Proceedings of the National Science Council, Republic of China, Part A: Physical Science and Engineering, 1996, 20 (02):
[24] 3D facial modeling for animation: A nonlinear approach
Wang, Yushun
Zhuang, Yueting
ADVANCES IN MULTIMEDIA MODELING, PT 1, 2007, 4351 : 64 - 73
[25] LBF based 3D Regression for Facial Animation
Yan, Congquan
Wang, Liang-Hao
Li, Jianing
Li, Dong-Xiao
Zhang, Ming
2016 INTERNATIONAL CONFERENCE ON VIRTUAL REALITY AND VISUALIZATION (ICVRV 2016), 2016, : 276 - 279
[26] A New Method of 3D Facial Expression Animation
Sun, Shuo
Ge, Chunbao
JOURNAL OF APPLIED MATHEMATICS, 2014,
[27] Speech-driven facial animation using a hierarchical model
Cosker, DP
Marshall, AD
Rosin, PL
Hicks, YA
IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING, 2004, 151 (04): : 314 - 321
[28] Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation
Fan, Yingruo
Lin, Zhaojiang
Saito, Jun
Wang, Wenping
Komura, Taku
PROCEEDINGS OF THE ACM ON COMPUTER GRAPHICS AND INTERACTIVE TECHNIQUES, 2022, 5 (01)
[29] A muscle-based 3D parametric lip model for speech-synchronized facial animation
King, SA
Parent, RE
Olsafsky, BL
DEFORMABLE AVATARS, 2001, 68 : 12 - 23
[30] Individual 3D face synthesis based on orthogonal photos and speech-driven facial animation
Shan, SG
Gao, W
Yan, J
Zhang, HM
Chen, XL
2000 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL III, PROCEEDINGS, 2000, : 238 - 241

← 1 2 3 4 5 →