In Defense of Image Pre-Training for Spatiotemporal Recognition

被引：0

作者：

Li, Xianhang ^{[1
]}

Wang, Huiyu ^{[2
]}

Wei, Chen ^{[2
]}

Mei, Jieru ^{[2
]}

Yuille, Alan ^{[2
]}

Zhou, Yuyin ^{[1
]}

Xie, Cihang ^{[1
]}

机构：

[1] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA

[2] Johns Hopkins Univ, Baltimore, MD USA

来源：

COMPUTER VISION, ECCV 2022, PT XXV | 2022年 / 13685卷

关键词：

Video classification; Imagenet pre-training; 3d convolution networks;

D O I：

10.1007/978-3-031-19806-9_39

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pretraining lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits the feature channels into spatial and temporal groups, to further enable a more thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2. Moreover, this new training pipeline consistently achieves better results on video recognition with significant speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with 4 GPUs. The code and models are available at https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.

引用

页码：675 / 691

页数：17

共 50 条

[31] Contrastive Language-Image Pre-Training with Knowledge Graphs
Pan, Xuran
Ye, Tianzhu
Han, Dongchen
Song, Shiji
Huang, Gao
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[32] CPR-CLIP: Multimodal Pre-Training for Composite Error Recognition in CPR Training
Wang, Shunli
Yang, Dingkang
Zhai, Peng
Zhang, Lihua
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 211 - 215
[33] PTWA: Pre-training with Word Attention for Chinese Named Entity Recognition
Ma, Kaixin
Liu, Meiling
Zhao, Tiejun
Zhou, Jiyun
Yu, Yang
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[34] Unified Speech-Text Pre-training for Speech Translation and Recognition
Tang, Yun
Gong, Hongyu
Dong, Ning
Wang, Changhan
Hsu, Wei-Ning
Gu, Jiatao
Baevski, Alexei
Li, Xian
Mohamed, Abdelrahman
Auli, Michael
Pino, Juan
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1488 - 1499
[35] TableVLM: Multi-modal Pre-training for Table Structure Recognition
Chen, Leiyuan
Huang, Chengsong
Zheng, Xiaoqing
Lin, Jinshu
Huang, Xuanjing
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 2437 - 2449
[36] BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization
Zhao, Weichao
Hu, Hezhen
Zhou, Wengang
Shi, Jiaxin
Li, Houqiang
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3597 - 3605
[37] Depression recognition using voice-based pre-training model
Huang, Xiangsheng
Wang, Fang
Gao, Yuan
Liao, Yilong
Zhang, Wenjing
Zhang, Li
Xu, Zhenrong
SCIENTIFIC REPORTS, 2024, 14 (01):
[38] A Multilingual Framework Based on Pre-training Model for Speech Emotion Recognition
Zhang, Zhaohang
Zhang, Xiaohui
Guo, Min
Zhang, Wei-Qiang
Li, Ke
Huang, Yukai
2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 750 - 755
[39] wav2vec: Unsupervised Pre-training for Speech Recognition
Schneider, Steffen
Baevski, Alexei
Collobert, Ronan
Auli, Michael
INTERSPEECH 2019, 2019, : 3465 - 3469
[40] Multi-stage Pre-training over Simplified Multimodal Pre-training Models
Liu, Tongtong
Feng, Fangxiang
Wang, Xiaojie
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2556 - 2565

← 1 2 3 4 5 →