In Defense of Image Pre-Training for Spatiotemporal Recognition

被引:0
|
作者
Li, Xianhang [1 ]
Wang, Huiyu [2 ]
Wei, Chen [2 ]
Mei, Jieru [2 ]
Yuille, Alan [2 ]
Zhou, Yuyin [1 ]
Xie, Cihang [1 ]
机构
[1] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA
[2] Johns Hopkins Univ, Baltimore, MD USA
来源
关键词
Video classification; Imagenet pre-training; 3d convolution networks;
D O I
10.1007/978-3-031-19806-9_39
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pretraining lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits the feature channels into spatial and temporal groups, to further enable a more thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2. Moreover, this new training pipeline consistently achieves better results on video recognition with significant speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with 4 GPUs. The code and models are available at https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.
引用
收藏
页码:675 / 691
页数:17
相关论文
共 50 条
  • [31] Contrastive Language-Image Pre-Training with Knowledge Graphs
    Pan, Xuran
    Ye, Tianzhu
    Han, Dongchen
    Song, Shiji
    Huang, Gao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [32] CPR-CLIP: Multimodal Pre-Training for Composite Error Recognition in CPR Training
    Wang, Shunli
    Yang, Dingkang
    Zhai, Peng
    Zhang, Lihua
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 211 - 215
  • [33] PTWA: Pre-training with Word Attention for Chinese Named Entity Recognition
    Ma, Kaixin
    Liu, Meiling
    Zhao, Tiejun
    Zhou, Jiyun
    Yu, Yang
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [34] Unified Speech-Text Pre-training for Speech Translation and Recognition
    Tang, Yun
    Gong, Hongyu
    Dong, Ning
    Wang, Changhan
    Hsu, Wei-Ning
    Gu, Jiatao
    Baevski, Alexei
    Li, Xian
    Mohamed, Abdelrahman
    Auli, Michael
    Pino, Juan
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1488 - 1499
  • [35] TableVLM: Multi-modal Pre-training for Table Structure Recognition
    Chen, Leiyuan
    Huang, Chengsong
    Zheng, Xiaoqing
    Lin, Jinshu
    Huang, Xuanjing
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 2437 - 2449
  • [36] BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization
    Zhao, Weichao
    Hu, Hezhen
    Zhou, Wengang
    Shi, Jiaxin
    Li, Houqiang
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3597 - 3605
  • [37] Depression recognition using voice-based pre-training model
    Huang, Xiangsheng
    Wang, Fang
    Gao, Yuan
    Liao, Yilong
    Zhang, Wenjing
    Zhang, Li
    Xu, Zhenrong
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [38] A Multilingual Framework Based on Pre-training Model for Speech Emotion Recognition
    Zhang, Zhaohang
    Zhang, Xiaohui
    Guo, Min
    Zhang, Wei-Qiang
    Li, Ke
    Huang, Yukai
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 750 - 755
  • [39] wav2vec: Unsupervised Pre-training for Speech Recognition
    Schneider, Steffen
    Baevski, Alexei
    Collobert, Ronan
    Auli, Michael
    INTERSPEECH 2019, 2019, : 3465 - 3469
  • [40] Multi-stage Pre-training over Simplified Multimodal Pre-training Models
    Liu, Tongtong
    Feng, Fangxiang
    Wang, Xiaojie
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2556 - 2565