UniTalker: Scaling up Audio-Driven 3D Facial Animation Through A Unified Model

被引:0
|
作者
Fan, Xiangyu [1 ]
Li, Jiaqi [1 ]
Lin, Zhiqian [1 ]
Xiao, Weiye [1 ]
Yang, Lei [1 ]
机构
[1] SenseTime Res, Hong Kong, Peoples R China
来源
关键词
Audio-driven; Facial animation; Unified Model;
D O I
10.1007/978-3-031-72940-9_12
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-driven 3D facial animation aims to map input audio to realistic facial motion. Despite significant progress, limitations arise from inconsistent 3D annotations, restricting previous models to training on specific annotations and thereby constraining the training scale. In this work, we present UniTalker, a unified model featuring a multi-head architecture designed to effectively leverage datasets with varied annotations. To enhance training stability and ensure consistency among multi-head outputs, we employ three training strategies, namely, PCA, model warm-up, and pivot identity embedding. To expand the training scale and diversity, we assemble A2F-Bench, comprising five publicly available datasets and three newly curated datasets. These datasets contain a wide range of audio domains, covering multilingual speech voices and songs, thereby scaling the training data from commonly employed datasets, typically less than 1 h, to 18.5 h. With a single trained UniTalker model, we achieve substantial lip vertex error reductions of 9.2% for BIWI dataset and 13.7% for Vocaset. Additionally, the pre-trained UniTalker exhibits promise as the foundation model for audio-driven facial animation tasks. Fine-tuning the pre-trained UniTalker on seen datasets further enhances performance on each dataset, with an average error reduction of 6.3% on A2F-Bench. Moreover, fine-tuning UniTalker on an unseen dataset with only half the data surpasses prior state-of-the-art models trained on the full dataset. The code and dataset are available at the project page (Homepage: https://github.com/X-niper/UniTalker).
引用
收藏
页码:204 / 221
页数:18
相关论文
共 50 条
  • [21] 3D Facial Animation for Mobile Devices
    De Martino, Jose Mario
    Leite, Tatiane Silvia
    WSCG 2010: FULL PAPERS PROCEEDINGS, 2010, : 81 - 87
  • [22] Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation
    Fu, Hui
    Wang, Zeqing
    Gong, Ke
    Wang, Keze
    Chen, Tianshui
    Li, Haojie
    Zeng, Haifeng
    Kang, Wenxiong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1770 - 1777
  • [23] Speech-Driven 3D Face Animation with Composite and Regional Facial Movements
    Wu, Haozhe
    Zhou, Songtao
    Jia, Jia
    Xing, Junliang
    Wen, Qi
    Wen, Xiang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6822 - 6830
  • [24] A comprehensive system for facial animation of generic 3D head models driven by speech
    Terissi, Lucas D.
    Cerda, Mauricio
    Gomez, Juan C.
    Hitschfeld-Kahler, Nancy
    Girau, Bernard
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2013,
  • [25] FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion
    Stan, Stefan
    Haque, Kazi Injamamul
    Yumak, Zerrin
    15TH ANNUAL ACM SIGGRAPH CONFERENCE ON MOTION, INTERACTION AND GAMES, MIG 2023, 2023,
  • [26] A comprehensive system for facial animation of generic 3D head models driven by speech
    Lucas D Terissi
    Mauricio Cerda
    Juan C Gómez
    Nancy Hitschfeld-Kahler
    Bernard Girau
    EURASIP Journal on Audio, Speech, and Music Processing, 2013
  • [27] KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding
    Xu, Zhihao
    Gong, Shengjie
    Tang, Jiapeng
    Liang, Lingyu
    Huang, Yining
    Li, Haojie
    Huang, Shuangping
    COMPUTER VISION - ECCV 2024, PT LVI, 2025, 15114 : 236 - 253
  • [28] CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior
    Xing, Jinbo
    Xia, Menghan
    Zhang, Yuechen
    Cun, Xiaodong
    Wang, Jue
    Wong, Tien-Tsin
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 12780 - 12790
  • [29] Transformer Based Multi-model Fusion for 3D Facial Animation
    Chen, Benwang
    Luo, Chunshui
    Wang, Haoqian
    2023 2ND CONFERENCE ON FULLY ACTUATED SYSTEM THEORY AND APPLICATIONS, CFASTA, 2023, : 659 - 663
  • [30] Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation
    He, Shan
    He, Haonan
    Yang, Shuo
    Wu, Xiaoyan
    Xia, Pengcheng
    Yin, Bing
    Liu, Cong
    Dai, Lirong
    Xu, Chang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 14146 - 14156