In Defense of Image Pre-Training for Spatiotemporal Recognition

被引:0
|
作者
Li, Xianhang [1 ]
Wang, Huiyu [2 ]
Wei, Chen [2 ]
Mei, Jieru [2 ]
Yuille, Alan [2 ]
Zhou, Yuyin [1 ]
Xie, Cihang [1 ]
机构
[1] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA
[2] Johns Hopkins Univ, Baltimore, MD USA
来源
关键词
Video classification; Imagenet pre-training; 3d convolution networks;
D O I
10.1007/978-3-031-19806-9_39
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pretraining lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits the feature channels into spatial and temporal groups, to further enable a more thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2. Moreover, this new training pipeline consistently achieves better results on video recognition with significant speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with 4 GPUs. The code and models are available at https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.
引用
收藏
页码:675 / 691
页数:17
相关论文
共 50 条
  • [21] Scaling Language-Image Pre-training via Masking
    Li, Yanghao
    Fan, Haoqi
    Hu, Ronghang
    Feichtenhofert, Christoph
    He, Kaiming
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23390 - 23400
  • [22] Conditional Embedding Pre-Training Language Model for Image Captioning
    Li, Pengfei
    Zhang, Min
    Lin, Peijie
    Wan, Jian
    Jiang, Ming
    NEURAL PROCESSING LETTERS, 2022, 54 (06) : 4987 - 5003
  • [23] Multimodal image encoding pre-training for diabetic retinopathy grading
    Hervella, Alvaro S.
    Rouco, Jose
    Novo, Jorge
    Ortega, Marcos
    COMPUTERS IN BIOLOGY AND MEDICINE, 2022, 143
  • [24] Conditional Embedding Pre-Training Language Model for Image Captioning
    Pengfei Li
    Min Zhang
    Peijie Lin
    Jian Wan
    Ming Jiang
    Neural Processing Letters, 2022, 54 : 4987 - 5003
  • [25] Clustering swap prediction for image-text pre-training
    Sun, Fayou
    Ngo, Hea Choon
    Sek, Yong Wee
    Meng, Zuqiang
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [26] Pre-training on Grayscale ImageNet Improves Medical Image Classification
    Xie, Yiting
    Richmond, David
    COMPUTER VISION - ECCV 2018 WORKSHOPS, PT VI, 2019, 11134 : 476 - 484
  • [27] DreamLIP: Language-Image Pre-training with Long Captions
    Zheng, Kecheng
    Zhang, Yifei
    Wu, Wei
    Lu, Fan
    Ma, Shuailei
    Jin, Xin
    Chen, Wei
    Shen, Yujun
    COMPUTER VISION-ECCV 2024, PT XVIII, 2025, 15076 : 73 - 90
  • [28] MimCo: Masked Image Modeling Pre-training with Contrastive Teacher
    Zhou, Qiang
    Yu, Chaohui
    Luo, Hao
    Wang, Zhibin
    Li, Hao
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4487 - 4495
  • [29] Image classification with quantum pre-training and auto-encoders
    Piat, Sebastien
    Usher, Nairi
    Severini, Simone
    Herbster, Mark
    Mansi, Tommaso
    Mountney, Peter
    INTERNATIONAL JOURNAL OF QUANTUM INFORMATION, 2018, 16 (08)
  • [30] Efficient Image Pre-training with Siamese Cropped Masked Autoencoders
    Eymael, Alexandre
    Vandeghen, Renaud
    Cioppa, Anthony
    Giancola, Silvio
    Ghanem, Bernard
    Van Droogenbroeck, Marc
    COMPUTER VISION - ECCV 2024, PT XXIII, 2025, 15081 : 348 - 366