UniTalker: Scaling up Audio-Driven 3D Facial Animation Through A Unified Model

被引：0

作者：

Fan, Xiangyu ^{[1
]}

Li, Jiaqi ^{[1
]}

Lin, Zhiqian ^{[1
]}

Xiao, Weiye ^{[1
]}

Yang, Lei ^{[1
]}

机构：

[1] SenseTime Res, Hong Kong, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT XLI | 2025年 / 15099卷

关键词：

Audio-driven; Facial animation; Unified Model;

D O I：

10.1007/978-3-031-72940-9_12

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-driven 3D facial animation aims to map input audio to realistic facial motion. Despite significant progress, limitations arise from inconsistent 3D annotations, restricting previous models to training on specific annotations and thereby constraining the training scale. In this work, we present UniTalker, a unified model featuring a multi-head architecture designed to effectively leverage datasets with varied annotations. To enhance training stability and ensure consistency among multi-head outputs, we employ three training strategies, namely, PCA, model warm-up, and pivot identity embedding. To expand the training scale and diversity, we assemble A2F-Bench, comprising five publicly available datasets and three newly curated datasets. These datasets contain a wide range of audio domains, covering multilingual speech voices and songs, thereby scaling the training data from commonly employed datasets, typically less than 1 h, to 18.5 h. With a single trained UniTalker model, we achieve substantial lip vertex error reductions of 9.2% for BIWI dataset and 13.7% for Vocaset. Additionally, the pre-trained UniTalker exhibits promise as the foundation model for audio-driven facial animation tasks. Fine-tuning the pre-trained UniTalker on seen datasets further enhances performance on each dataset, with an average error reduction of 6.3% on A2F-Bench. Moreover, fine-tuning UniTalker on an unseen dataset with only half the data surpasses prior state-of-the-art models trained on the full dataset. The code and dataset are available at the project page (Homepage: https://github.com/X-niper/UniTalker).

引用

页码：204 / 221

页数：18

共 50 条

[21] 3D Facial Animation for Mobile Devices
De Martino, Jose Mario
Leite, Tatiane Silvia
WSCG 2010: FULL PAPERS PROCEEDINGS, 2010, : 81 - 87
[22] Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation
Fu, Hui
Wang, Zeqing
Gong, Ke
Wang, Keze
Chen, Tianshui
Li, Haojie
Zeng, Haifeng
Kang, Wenxiong
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1770 - 1777
[23] Speech-Driven 3D Face Animation with Composite and Regional Facial Movements
Wu, Haozhe
Zhou, Songtao
Jia, Jia
Xing, Junliang
Wen, Qi
Wen, Xiang
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6822 - 6830
[24] A comprehensive system for facial animation of generic 3D head models driven by speech
Terissi, Lucas D.
Cerda, Mauricio
Gomez, Juan C.
Hitschfeld-Kahler, Nancy
Girau, Bernard
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2013,
[25] FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion
Stan, Stefan
Haque, Kazi Injamamul
Yumak, Zerrin
15TH ANNUAL ACM SIGGRAPH CONFERENCE ON MOTION, INTERACTION AND GAMES, MIG 2023, 2023,
[26] A comprehensive system for facial animation of generic 3D head models driven by speech
Lucas D Terissi
Mauricio Cerda
Juan C Gómez
Nancy Hitschfeld-Kahler
Bernard Girau
EURASIP Journal on Audio, Speech, and Music Processing, 2013
[27] KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding
Xu, Zhihao
Gong, Shengjie
Tang, Jiapeng
Liang, Lingyu
Huang, Yining
Li, Haojie
Huang, Shuangping
COMPUTER VISION - ECCV 2024, PT LVI, 2025, 15114 : 236 - 253
[28] CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior
Xing, Jinbo
Xia, Menghan
Zhang, Yuechen
Cun, Xiaodong
Wang, Jue
Wong, Tien-Tsin
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 12780 - 12790
[29] Transformer Based Multi-model Fusion for 3D Facial Animation
Chen, Benwang
Luo, Chunshui
Wang, Haoqian
2023 2ND CONFERENCE ON FULLY ACTUATED SYSTEM THEORY AND APPLICATIONS, CFASTA, 2023, : 659 - 663
[30] Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation
He, Shan
He, Haonan
Yang, Shuo
Wu, Xiaoyan
Xia, Pengcheng
Yin, Bing
Liu, Cong
Dai, Lirong
Xu, Chang
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 14146 - 14156

← 1 2 3 4 5 →