In Defense of Image Pre-Training for Spatiotemporal Recognition

被引:0
|
作者
Li, Xianhang [1 ]
Wang, Huiyu [2 ]
Wei, Chen [2 ]
Mei, Jieru [2 ]
Yuille, Alan [2 ]
Zhou, Yuyin [1 ]
Xie, Cihang [1 ]
机构
[1] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA
[2] Johns Hopkins Univ, Baltimore, MD USA
来源
COMPUTER VISION, ECCV 2022, PT XXV | 2022年 / 13685卷
关键词
Video classification; Imagenet pre-training; 3d convolution networks;
D O I
10.1007/978-3-031-19806-9_39
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pretraining lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits the feature channels into spatial and temporal groups, to further enable a more thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2. Moreover, this new training pipeline consistently achieves better results on video recognition with significant speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with 4 GPUs. The code and models are available at https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.
引用
收藏
页码:675 / 691
页数:17
相关论文
共 50 条
  • [41] TWO-STAGE PRE-TRAINING FOR SEQUENCE TO SEQUENCE SPEECH RECOGNITION
    Fan, Zhiyun
    Zhou, Shiyu
    Xu, Bo
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [42] Table Pre-training: A Survey on Model Architectures, Pre-training Objectives, and Downstream Tasks
    Dong, Haoyu
    Cheng, Zhoujun
    He, Xinyi
    Zhou, Mengyu
    Zhou, Anda
    Zhou, Fan
    Liu, Ao
    Han, Shi
    Zhang, Dongmei
    PROCEEDINGS OF THE THIRTY-FIRST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2022, 2022, : 5426 - 5435
  • [43] Benchmarking the influence of pre-training on explanation performance in MR image classification
    Oliveira, Marta
    Wilming, Rick
    Clark, Benedict
    Budding, Celine
    Eitel, Fabian
    Ritter, Kerstin
    Haufe, Stefan
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2024, 7
  • [44] NLIP: Noise-Robust Language-Image Pre-training
    Huang, Runhui
    Long, Yanxin
    Han, Jianhua
    Xu, Hang
    Liang, Xiwen
    Xu, Chunjing
    Liang, Xiaodan
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 926 - 934
  • [45] UniCLIP: Unified Framework for Contrastive Language-Image Pre-training
    Lee, Janghyeon
    Kim, Jongsuk
    Shon, Hyounguk
    Kim, Bumsoo
    Kim, Seung Hwan
    Lee, Honglak
    Kim, Junmo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [46] Hybrid Pre-training Based on Masked Autoencoders for Medical Image Segmentation
    Han, Yufei
    Chen, Haoyuan
    Xu, Pin
    Li, Yanyi
    Li, Kuan
    Yin, Jianping
    THEORETICAL COMPUTER SCIENCE, NCTCS 2022, 2022, 1693 : 175 - 182
  • [47] Enhancing Dynamic Image Advertising with Vision-Language Pre-training
    Wen, Zhoufutu
    Zhao, Xinyu
    Jin, Zhipeng
    Yang, Yi
    Jia, Wei
    Chen, Xiaodong
    Li, Shuanglong
    Liu, Lin
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3310 - 3314
  • [48] MRM: Masked Relation Modeling for Medical Image Pre-Training with Genetics
    Yang, Qiushi
    Li, Wuyang
    Li, Baopu
    Yuan, Yixuan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21395 - 21405
  • [49] Historical document image analysis using controlled data for pre-training
    Rahal, Najoua
    Vogtlin, Lars
    Ingold, Rolf
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2023, 26 (03) : 241 - 254
  • [50] ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
    Yang, Kaicheng
    Deng, Jiankang
    An, Xiang
    Li, Jiawei
    Feng, Ziyong
    Guo, Jia
    Yang, Jing
    Liu, Tongliang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2910 - 2919