Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

被引:7
|
作者
Qing, Zhiwu [1 ]
Zhang, Shiwei [2 ]
Huang, Ziyuan [3 ]
Zhang, Yingya [2 ]
Gao, Changxin [1 ]
Zhao, Deli [2 ]
Sang, Nong [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Artificial Intelligence & Automat, Key Lab Image Proc & Intelligent Control, Wuhan, Peoples R China
[2] Alibaba Grp, Hangzhou, Zhejiang, Peoples R China
[3] Natl Univ Singapore, ARC, Singapore, Singapore
基金
中国国家自然科学基金;
关键词
D O I
10.1109/ICCV51070.2023.01281
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, large-scale pre-trained language-image models like CLIP have shown extraordinary capabilities for understanding spatial contents, but naively transferring such models to video recognition still suffers from unsatisfactory temporal modeling capabilities. Existing methods insert tunable structures into or in parallel with the pre-trained model, which either requires back-propagation through the whole pre-trained model and is thus resource-demanding, or is limited by the temporal reasoning capability of the pre-trained structure. In this work, we present DiST, which disentangles the learning of spatial and temporal aspects of videos. Specifically, DiST uses a dual-encoder structure, where a pre-trained foundation model acts as the spatial encoder, and a lightweight network is introduced as the temporal encoder. An integration branch is inserted between the encoders to fuse spatio-temporal information. The disentangled spatial and temporal learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters. Meanwhile, we empirically show that disentangled learning with an extra network for integration benefits both spatial and temporal understanding. Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps. When pre-training on the large-scale Kinetics-710, we achieve 89.7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability of DiST. Codes and models can be found in https://github.com/alibaba-mmai-research/DiST.
引用
收藏
页码:13888 / 13898
页数:11
相关论文
共 50 条
  • [21] iHair Recolorer: deep image-to-video hair color transfer
    Keyu WU
    Lingchen YANG
    Hongbo FU
    Youyi ZHENG
    ScienceChina(InformationSciences), 2021, 64 (11) : 52 - 66
  • [22] Learned Video Compression With Efficient Temporal Context Learning
    Jin, Dengchao
    Lei, Jianjun
    Peng, Bo
    Pan, Zhaoqing
    Li, Li
    Ling, Nam
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3188 - 3198
  • [23] Temporal Knowledge Propagation for Image-to-Video Person Re-identification
    Gu, Xinqian
    Ma, Bingpeng
    Chang, Hong
    Shan, Shiguang
    Chen, Xilin
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9646 - 9655
  • [24] Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
    Liu, Ruyang
    Huang, Jingjia
    Li, Ge
    Feng, Jiashi
    Wu, Xinglong
    Li, Thomas H.
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6555 - 6564
  • [25] Video Inpainting by Jointly Learning Temporal Structure and Spatial Details
    Wang, Chuan
    Huang, Haibin
    Han, Xiaoguang
    Wang, Jue
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 5232 - 5239
  • [26] Temporal Aggregation of Visual Features for Large-Scale Image-to-Video Retrieval
    Garcia, Noa
    ICMR '18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2018, : 489 - 492
  • [27] Taskonomy: Disentangling Task Transfer Learning
    Zamir, Amir R.
    Sax, Alexander
    Shen, William
    Guibas, Leonidas
    Malik, Jitendra
    Savarese, Silvio
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3712 - 3722
  • [28] Disentangling Transfer in Continual Reinforcement Learning
    Wolczyk, Maciej
    Zajac, Michal
    Pascanu, Razvan
    Kucinski, Lukasz
    Milos, Piotr
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [29] Taskonomy: Disentangling Task Transfer Learning
    Zamir, Amir
    Sax, Alexander
    Shen, William
    Guibas, Leonidas
    Malik, Jitendra
    Savarese, Silvio
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 6241 - 6245
  • [30] Dual-MTGAN: Stochastic and Deterministic Motion Transfer for Image-to-Video Synthesis
    Yang, Fu-En
    Chang, Jing-Cheng
    Lee, Yuan-Hao
    Wang, Yu-Chiang Frank
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 6764 - 6771