UniTalker: Scaling up Audio-Driven 3D Facial Animation Through A Unified Model

被引:0
|
作者
Fan, Xiangyu [1 ]
Li, Jiaqi [1 ]
Lin, Zhiqian [1 ]
Xiao, Weiye [1 ]
Yang, Lei [1 ]
机构
[1] SenseTime Res, Hong Kong, Peoples R China
来源
关键词
Audio-driven; Facial animation; Unified Model;
D O I
10.1007/978-3-031-72940-9_12
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-driven 3D facial animation aims to map input audio to realistic facial motion. Despite significant progress, limitations arise from inconsistent 3D annotations, restricting previous models to training on specific annotations and thereby constraining the training scale. In this work, we present UniTalker, a unified model featuring a multi-head architecture designed to effectively leverage datasets with varied annotations. To enhance training stability and ensure consistency among multi-head outputs, we employ three training strategies, namely, PCA, model warm-up, and pivot identity embedding. To expand the training scale and diversity, we assemble A2F-Bench, comprising five publicly available datasets and three newly curated datasets. These datasets contain a wide range of audio domains, covering multilingual speech voices and songs, thereby scaling the training data from commonly employed datasets, typically less than 1 h, to 18.5 h. With a single trained UniTalker model, we achieve substantial lip vertex error reductions of 9.2% for BIWI dataset and 13.7% for Vocaset. Additionally, the pre-trained UniTalker exhibits promise as the foundation model for audio-driven facial animation tasks. Fine-tuning the pre-trained UniTalker on seen datasets further enhances performance on each dataset, with an average error reduction of 6.3% on A2F-Bench. Moreover, fine-tuning UniTalker on an unseen dataset with only half the data surpasses prior state-of-the-art models trained on the full dataset. The code and dataset are available at the project page (Homepage: https://github.com/X-niper/UniTalker).
引用
收藏
页码:204 / 221
页数:18
相关论文
共 50 条
  • [1] EmoFace: Audio-driven Emotional 3D Face Animation
    Liu, Chang
    Lin, Qunfen
    Zeng, Zijiao
    Pan, Ye
    2024 IEEE CONFERENCE ON VIRTUAL REALITY AND 3D USER INTERFACES, VR 2024, 2024, : 387 - 397
  • [2] Personalized Audio-Driven 3D Facial Animation via Style-Content Disentanglement
    Chai, Yujin
    Shao, Tianjia
    Weng, Yanlin
    Zhou, Kun
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (03) : 1803 - 1820
  • [3] Audio-Driven Facial Animation with Deep Learning: A Survey
    Jiang, Diqiong
    Chang, Jian
    You, Lihua
    Bian, Shaojun
    Kosk, Robert
    Maguire, Greg
    INFORMATION, 2024, 15 (11)
  • [4] Multi-Task Audio-Driven Facial Animation
    Kim, Youngsoo
    An, Shounan
    Jo, Youngbak
    Park, Seungje
    Kang, Shindong
    Oh, Insoo
    Kim, Duke Donghyun
    SIGGRAPH '19 - ACM SIGGRAPH 2019 POSTERS, 2019,
  • [5] A Comparative Study of Four 3D Facial Animation Methods: Skeleton, Blendshape, Audio-Driven, and Vision-Based Capture
    Wei, Mingzhu
    Adamo, Nicoletta
    Giri, Nandhini
    Chen, Yingjie
    ARTSIT, INTERACTIVITY AND GAME CREATION, ARTSIT 2022, 2023, 479 : 36 - 50
  • [6] Audio-driven Talking Head Generation with Transformer and 3D Morphable Model
    Huang, Ricong
    Zhong, Weizhi
    Li, Guanbin
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7035 - 7039
  • [7] Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation
    Fan, Yingruo
    Lin, Zhaojiang
    Saito, Jun
    Wang, Wenping
    Komura, Taku
    PROCEEDINGS OF THE ACM ON COMPUTER GRAPHICS AND INTERACTIVE TECHNIQUES, 2022, 5 (01)
  • [8] Audio-Driven Lips and Expression on 3D Human Face
    Ma, Le
    Ma, Zhihao
    Meng, Weiliang
    Xu, Shibiao
    Zhang, Xiaopeng
    ADVANCES IN COMPUTER GRAPHICS, CGI 2023, PT II, 2024, 14496 : 15 - 26
  • [9] Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion
    Karras, Tero
    Aila, Timo
    Laine, Samuli
    Herva, Antti
    Lehtinen, Jaakko
    ACM TRANSACTIONS ON GRAPHICS, 2017, 36 (04):
  • [10] SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
    Zhang, Wenxuan
    Cun, Xiaodong
    Wang, Xuan
    Zhang, Yong
    Shen, Xi
    Guo, Yu
    Shan, Ying
    Wang, Fei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 8652 - 8661