Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

被引:7
|
作者
Qing, Zhiwu [1 ]
Zhang, Shiwei [2 ]
Huang, Ziyuan [3 ]
Zhang, Yingya [2 ]
Gao, Changxin [1 ]
Zhao, Deli [2 ]
Sang, Nong [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Artificial Intelligence & Automat, Key Lab Image Proc & Intelligent Control, Wuhan, Peoples R China
[2] Alibaba Grp, Hangzhou, Zhejiang, Peoples R China
[3] Natl Univ Singapore, ARC, Singapore, Singapore
基金
中国国家自然科学基金;
关键词
D O I
10.1109/ICCV51070.2023.01281
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, large-scale pre-trained language-image models like CLIP have shown extraordinary capabilities for understanding spatial contents, but naively transferring such models to video recognition still suffers from unsatisfactory temporal modeling capabilities. Existing methods insert tunable structures into or in parallel with the pre-trained model, which either requires back-propagation through the whole pre-trained model and is thus resource-demanding, or is limited by the temporal reasoning capability of the pre-trained structure. In this work, we present DiST, which disentangles the learning of spatial and temporal aspects of videos. Specifically, DiST uses a dual-encoder structure, where a pre-trained foundation model acts as the spatial encoder, and a lightweight network is introduced as the temporal encoder. An integration branch is inserted between the encoders to fuse spatio-temporal information. The disentangled spatial and temporal learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters. Meanwhile, we empirically show that disentangled learning with an extra network for integration benefits both spatial and temporal understanding. Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps. When pre-training on the large-scale Kinetics-710, we achieve 89.7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability of DiST. Codes and models can be found in https://github.com/alibaba-mmai-research/DiST.
引用
收藏
页码:13888 / 13898
页数:11
相关论文
共 50 条
  • [1] R2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
    Liu, Ye
    He, Jixuan
    Li, Wanhua
    Kim, Junsik
    Wei, Donglai
    Pfister, Hanspeter
    Chen, Chang Wen
    COMPUTER VISION - ECCV 2024, PT XLI, 2025, 15099 : 421 - 438
  • [2] ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning
    Pan, Junting
    Lin, Ziyi
    Zhu, Xiatian
    Shao, Jing
    Li, Hongsheng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [3] TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models
    Zhang, Zhongwei
    Long, Fuchen
    Pan, Yingwei
    Qiu, Zhaofan
    Yao, Ting
    Cao, Yang
    Mei, Tao
    arXiv,
  • [4] TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models
    Zhang, Zhongwei
    Long, Fuchen
    Pan, Yingwei
    Qiu, Zhaofan
    Yao, Ting
    Cao, Yang
    Mei, Tao
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 8671 - 8681
  • [5] Spatial-temporal Causal Inference for Partial Image-to-video Adaptation
    Chen, Jin
    Wu, Xinxiao
    Hu, Yao
    Luo, Jiebo
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1027 - 1035
  • [6] Activity Image-to-Video Retrieval by Disentangling Appearance and Motion
    Liu, Liu
    Li, Jiangtong
    Niu, Li
    Xu, Ruicong
    Zhang, Liqing
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2145 - 2153
  • [7] Temporal Complementarity-Guided Reinforcement Learning for Image-to-Video Person Re-Identification
    Wu, Wei
    Liu, Jiawei
    Zheng, Kecheng
    Sun, Qibin
    Zha, Zheng-Jun
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 7309 - 7318
  • [8] Learning to Forecast and Refine Residual Motion for Image-to-Video Generation
    Zhao, Long
    Peng, Xi
    Tian, Yu
    Kapadia, Mubbasir
    Metaxas, Dimitris
    COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 : 403 - 419
  • [9] Activity Image-to-Video Retrieval via Domain Adversarial Learning
    Liu, Yubin
    Yang, Jinfu
    Yan, Xue
    Song, Lin
    2022 34TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2022, : 6183 - 6188
  • [10] Unsupervised Image-to-Video Clothing Transfer
    Pumarola, A.
    Goswami, V.
    Vicente, F.
    De la Torre, F.
    Moreno-Noguer, F.
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 3181 - 3184