MLUG: Bootstrapping Language-Motion Pre-Training for Unified Motion-Language Understanding and Generation

被引:0
|
作者
Luo, Hongliang [1 ]
Xi, Wei [1 ]
Tang, Daniel [2 ]
机构
[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Xian 710049, Peoples R China
[2] Mind Bridge AI Ltd, Ottawa, ON K1S 5R5, Canada
关键词
motion generation; language motion; unified models;
D O I
10.3390/s24227354
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
In the realm of computer vision and animation, the generation of human motion from textual descriptions represents a frontier of significant challenge and potential. This paper introduces MLUG, a groundbreaking framework poised to transform motion synthesis by harnessing the power of vision-language pre-training techniques. MLUG addresses the nuanced challenge of creating semantically rich, physically plausible, and emotionally expressive human motions through a novel integration of a unimodal encoder with motion-text contrastive loss, a motion-grounded text encoder, a motion-grounded motion decoder, and a motion length predictor. These components work in concert to align textual descriptions with dynamic motion sequences, offering an innovative solution to the limitations of existing models in open-vocabulary motion generation and emotional expressiveness. Through extensive evaluations, MLUG demonstrates unparalleled effectiveness in generating realistic and diverse motions from a broad spectrum of textual inputs, setting a new benchmark in the field.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
    Bao, Hangbo
    Wang, Wenhui
    Dong, Li
    Liu, Qiang
    Mohammed, Owais Khan
    Aggarwal, Kriti
    Som, Subhojit
    Piao, Songhao
    Wei, Furu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [32] Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning
    Chen, Qian
    Wang, Wen
    Zhang, Qinglin
    INTERSPEECH 2021, 2021, : 1244 - 1248
  • [33] Survey on Vision-language Pre-training
    Yin J.
    Zhang Z.-D.
    Gao Y.-H.
    Yang Z.-W.
    Li L.
    Xiao M.
    Sun Y.-Q.
    Yan C.-G.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
  • [34] Speech Model Pre-training for End-to-End Spoken Language Understanding
    Lugosch, Loren
    Ravanelli, Mirco
    Ignoto, Patrick
    Tomar, Vikrant Singh
    Bengio, Yoshua
    INTERSPEECH 2019, 2019, : 814 - 818
  • [35] Pre-training Language Models for Comparative Reasoning
    Yu, Mengxia
    Zhang, Zhihan
    Yu, Wenhao
    Jiang, Meng
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 12421 - 12433
  • [36] Sigmoid Loss for Language Image Pre-Training
    Zhai, Xiaohua
    Mustafa, Basil
    Kolesnikov, Alexander
    Beyer, Lucas
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11941 - 11952
  • [37] MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding
    Li, Junlong
    Xu, Yiheng
    Cui, Lei
    Wei, Furu
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6078 - 6087
  • [38] Grounded Language-Image Pre-training
    Li, Liunian Harold
    Zhang, Pengchuan
    Zhang, Haotian
    Yang, Jianwei
    Li, Chunyuan
    Zhong, Yiwu
    Wang, Lijuan
    Yuan, Lu
    Zhang, Lei
    Hwang, Jenq-Neng
    Chang, Kai-Wei
    Gao, Jianfeng
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10955 - 10965
  • [39] VILA: On Pre-training for Visual Language Models
    Lin, Ji
    Yin, Hongxu
    Ping, Wei
    Molchanov, Pavlo
    Shoeybi, Mohammad
    Han, Song
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 26679 - 26689
  • [40] RELATION ENHANCED VISION LANGUAGE PRE-TRAINING
    Lee, Ju-Hee
    Kang, Je-Won
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2286 - 2290