Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning

被引:2
|
作者
Zhu, Minghao [1 ]
Lin, Xiao [1 ]
Dang, Ronghao [1 ]
Liu, Chengju [1 ]
Chen, Qijun [1 ]
机构
[1] Tongji Univ, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
Self-supervised Learning; Action Recognition;
D O I
10.1145/3581783.3611932
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a Fine-grained Motion Alignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space. Moreover, a frame-level motion contrastive loss is presented to improve the temporal diversity of the motion features. Extensive experiments demonstrate that the representations learned by FIMA possess great motion-awareness capabilities and achieve state-of-the-art or competitive results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code is available at https://github.com/ZMHH- H/FIMA.
引用
收藏
页码:4725 / 4736
页数:12
相关论文
共 50 条
  • [41] Partial-Label Contrastive Representation Learning for Fine-Grained Biomarkers Prediction From Histopathology Whole Slide Images
    Zheng, Yushan
    Wu, Kun
    Li, Jun
    Tang, Kunming
    Shi, Jun
    Wu, Haibo
    Jiang, Zhiguo
    Wang, Wei
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2025, 29 (01) : 396 - 408
  • [42] Fine-grained Audible Video Description
    Shen, Xuyang
    Li, Dong
    Zhou, Jinxing
    Qin, Zhen
    He, Bowen
    Han, Xiaodong
    Li, Aixuan
    Dai, Yuchao
    Kong, Lingpeng
    Wang, Meng
    Qiao, Yu
    Zhong, Yiran
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10585 - 10596
  • [43] Fine-Grained Scalable Video Caching
    Gong, Qiushi
    Woods, John W.
    Kar, Koushik
    Chakareski, Jacob
    2015 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2015, : 101 - 106
  • [44] JOINT LEARNING ON THE HIERARCHY REPRESENTATION FOR FINE-GRAINED HUMAN ACTION RECOGNITION
    Leong, Mei Chee
    Tan, Hui Li
    Zhang, Haosong
    Li, Liyuan
    Lin, Feng
    Lim, Joo Hwee
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1059 - 1063
  • [45] DeepFirearm: Learning Discriminative Feature Representation for Fine-grained Firearm Retrieval
    Hao, Jiedong
    Dong, Jing
    Wang, Wei
    Tan, Tieniu
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 3335 - 3340
  • [46] Fine-Grained Representation Learning and Recognition by Exploiting Hierarchical Semantic Embedding
    Chen, Tianshui
    Wu, Wenxi
    Gao, Yuefang
    Dong, Le
    Luo, Xiaonan
    Lin, Liang
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 2023 - 2031
  • [47] LEARNING DEEP AND SPARSE FEATURE REPRESENTATION FOR FINE-GRAINED OBJECT RECOGNITION
    Srinivas, M.
    Lin, Yen-Yu
    Liao, Hong-Yuan Mark
    2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 1458 - 1463
  • [48] Fine-Grained Early Frequency Attention for Deep Speaker Representation Learning
    Hajavi A.
    Etemad A.
    IEEE Transactions on Artificial Intelligence, 2023, 4 (06): : 1413 - 1425
  • [49] Attribute-Aware Attention Model for Fine-grained Representation Learning
    Han, Kai
    Guo, Jianyuan
    Zhang, Chao
    Zhu, Mingjian
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 2040 - 2048
  • [50] Fine-grained cybersecurity entity typing based on multimodal representation learning
    Wang, Baolei
    Zhang, Xuan
    Wang, Jishu
    Gao, Chen
    Duan, Qing
    Li, Linyu
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (10) : 30207 - 30232