In Defense of Image Pre-Training for Spatiotemporal Recognition

被引：0

作者：

Li, Xianhang ^{[1
]}

Wang, Huiyu ^{[2
]}

Wei, Chen ^{[2
]}

Mei, Jieru ^{[2
]}

Yuille, Alan ^{[2
]}

Zhou, Yuyin ^{[1
]}

Xie, Cihang ^{[1
]}

机构：

[1] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA

[2] Johns Hopkins Univ, Baltimore, MD USA

来源：

COMPUTER VISION, ECCV 2022, PT XXV | 2022年 / 13685卷

关键词：

Video classification; Imagenet pre-training; 3d convolution networks;

D O I：

10.1007/978-3-031-19806-9_39

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pretraining lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits the feature channels into spatial and temporal groups, to further enable a more thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2. Moreover, this new training pipeline consistently achieves better results on video recognition with significant speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with 4 GPUs. The code and models are available at https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.

引用

页码：675 / 691

页数：17

共 50 条

[21] Scaling Language-Image Pre-training via Masking
Li, Yanghao
Fan, Haoqi
Hu, Ronghang
Feichtenhofert, Christoph
He, Kaiming
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23390 - 23400
[22] Conditional Embedding Pre-Training Language Model for Image Captioning
Li, Pengfei
Zhang, Min
Lin, Peijie
Wan, Jian
Jiang, Ming
NEURAL PROCESSING LETTERS, 2022, 54 (06) : 4987 - 5003
[23] Multimodal image encoding pre-training for diabetic retinopathy grading
Hervella, Alvaro S.
Rouco, Jose
Novo, Jorge
Ortega, Marcos
COMPUTERS IN BIOLOGY AND MEDICINE, 2022, 143
[24] Conditional Embedding Pre-Training Language Model for Image Captioning
Pengfei Li
Min Zhang
Peijie Lin
Jian Wan
Ming Jiang
Neural Processing Letters, 2022, 54 : 4987 - 5003
[25] Clustering swap prediction for image-text pre-training
Sun, Fayou
Ngo, Hea Choon
Sek, Yong Wee
Meng, Zuqiang
SCIENTIFIC REPORTS, 2024, 14 (01):
[26] Pre-training on Grayscale ImageNet Improves Medical Image Classification
Xie, Yiting
Richmond, David
COMPUTER VISION - ECCV 2018 WORKSHOPS, PT VI, 2019, 11134 : 476 - 484
[27] DreamLIP: Language-Image Pre-training with Long Captions
Zheng, Kecheng
Zhang, Yifei
Wu, Wei
Lu, Fan
Ma, Shuailei
Jin, Xin
Chen, Wei
Shen, Yujun
COMPUTER VISION-ECCV 2024, PT XVIII, 2025, 15076 : 73 - 90
[28] MimCo: Masked Image Modeling Pre-training with Contrastive Teacher
Zhou, Qiang
Yu, Chaohui
Luo, Hao
Wang, Zhibin
Li, Hao
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4487 - 4495
[29] Image classification with quantum pre-training and auto-encoders
Piat, Sebastien
Usher, Nairi
Severini, Simone
Herbster, Mark
Mansi, Tommaso
Mountney, Peter
INTERNATIONAL JOURNAL OF QUANTUM INFORMATION, 2018, 16 (08)
[30] Efficient Image Pre-training with Siamese Cropped Masked Autoencoders
Eymael, Alexandre
Vandeghen, Renaud
Cioppa, Anthony
Giancola, Silvio
Ghanem, Bernard
Van Droogenbroeck, Marc
COMPUTER VISION - ECCV 2024, PT XXIII, 2025, 15081 : 348 - 366

← 1 2 3 4 5 →