In Defense of Image Pre-Training for Spatiotemporal Recognition

被引：0

作者：

Li, Xianhang ^{[1
]}

Wang, Huiyu ^{[2
]}

Wei, Chen ^{[2
]}

Mei, Jieru ^{[2
]}

Yuille, Alan ^{[2
]}

Zhou, Yuyin ^{[1
]}

Xie, Cihang ^{[1
]}

机构：

[1] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA

[2] Johns Hopkins Univ, Baltimore, MD USA

来源：

COMPUTER VISION, ECCV 2022, PT XXV | 2022年 / 13685卷

关键词：

Video classification; Imagenet pre-training; 3d convolution networks;

D O I：

10.1007/978-3-031-19806-9_39

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pretraining lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits the feature channels into spatial and temporal groups, to further enable a more thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2. Moreover, this new training pipeline consistently achieves better results on video recognition with significant speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with 4 GPUs. The code and models are available at https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.

引用

页码：675 / 691

页数：17

共 50 条

[41] TWO-STAGE PRE-TRAINING FOR SEQUENCE TO SEQUENCE SPEECH RECOGNITION
Fan, Zhiyun
Zhou, Shiyu
Xu, Bo
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[42] Table Pre-training: A Survey on Model Architectures, Pre-training Objectives, and Downstream Tasks
Dong, Haoyu
Cheng, Zhoujun
He, Xinyi
Zhou, Mengyu
Zhou, Anda
Zhou, Fan
Liu, Ao
Han, Shi
Zhang, Dongmei
PROCEEDINGS OF THE THIRTY-FIRST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2022, 2022, : 5426 - 5435
[43] Benchmarking the influence of pre-training on explanation performance in MR image classification
Oliveira, Marta
Wilming, Rick
Clark, Benedict
Budding, Celine
Eitel, Fabian
Ritter, Kerstin
Haufe, Stefan
FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2024, 7
[44] NLIP: Noise-Robust Language-Image Pre-training
Huang, Runhui
Long, Yanxin
Han, Jianhua
Xu, Hang
Liang, Xiwen
Xu, Chunjing
Liang, Xiaodan
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 926 - 934
[45] UniCLIP: Unified Framework for Contrastive Language-Image Pre-training
Lee, Janghyeon
Kim, Jongsuk
Shon, Hyounguk
Kim, Bumsoo
Kim, Seung Hwan
Lee, Honglak
Kim, Junmo
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[46] Hybrid Pre-training Based on Masked Autoencoders for Medical Image Segmentation
Han, Yufei
Chen, Haoyuan
Xu, Pin
Li, Yanyi
Li, Kuan
Yin, Jianping
THEORETICAL COMPUTER SCIENCE, NCTCS 2022, 2022, 1693 : 175 - 182
[47] Enhancing Dynamic Image Advertising with Vision-Language Pre-training
Wen, Zhoufutu
Zhao, Xinyu
Jin, Zhipeng
Yang, Yi
Jia, Wei
Chen, Xiaodong
Li, Shuanglong
Liu, Lin
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3310 - 3314
[48] MRM: Masked Relation Modeling for Medical Image Pre-Training with Genetics
Yang, Qiushi
Li, Wuyang
Li, Baopu
Yuan, Yixuan
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21395 - 21405
[49] Historical document image analysis using controlled data for pre-training
Rahal, Najoua
Vogtlin, Lars
Ingold, Rolf
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2023, 26 (03) : 241 - 254
[50] ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
Yang, Kaicheng
Deng, Jiankang
An, Xiang
Li, Jiawei
Feng, Ziyong
Guo, Jia
Yang, Jing
Liu, Tongliang
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2910 - 2919

← 1 2 3 4 5 →