Contrastive Language-Video Learning Model Based on Spatio-Temporal Information Auxiliary Supervision

被引:0
|
作者
Zhang, Bing-Bing [1 ,2 ]
Zhang, Jian-Xin [1 ]
Li, Pei-Hua [2 ]
机构
[1] School of Computer Science and Engineering, Dalian Minzu University, Liaoning, Dalian,116650, China
[2] School of Information and Communication Engineering, Dalian University of Technology, Liaoning, Dalian,116033, China
来源
关键词
3D modeling - BASIC (programming language) - Computer vision - Human computer interaction - Modeling languages - Semantics - Three dimensional computer graphics - Video recording - Visual languages;
D O I
10.11897/SP.J.1016.2024.01769
中图分类号
学科分类号
摘要
Video action recognition is one of the hot topics in the field of computer vision, which has attracted the attention of many researchers in recent decades. The basic method of video action recognition is widely used in Internet video audit, video surveillance, human-computer interaction and other fields. The main body of video is usually human. Because of the complexity and variability of human action categories and environment in real life, and the huge amount of video, it requires high computing devices, which brings great challenges to the task of video action recognition. In the field of video surveillance, most of the existing systems only record abnormal actions and cannot recognize it in real time, so they cannot realize real intelligence;while in the field of Internet video audit, a lot of manual audits is often needed, which can’t recognize human action in real time. Video can usually be regarded as images that change with time. This special image data contains rich information. To recognize actions from video, it is not only necessary to obtain the spatial information of the image at each moment, but also to capture the temporal reasoning information between frames, and more importantly, to obtain the spatio-temporal information. To this end, researchers have developed many network architectures for video action recognition tasks, which can be divided into the following four categories: two-stream convolutional neural networks (CNNs) based methods, 3D CNNs based methods, 2D convolutional network with spatio-temporal modeling module, and visual Transformer-based networks. The use of Transformer-based network models that integrate both language and image modalities has made great progress in the field of computer vision. There are three representative research works in computer vision tasks related to images:namely Contrastive Language-Image Pre-training (CLIP) model, A Large-scale Image and Noisy-text embedding (ALIGN) model and Florence model. However, when these models are applied to video recognition tasks, there are still some limitations that need to be addressed, such as the lack of consideration of rich spatiotemporal information in videos and the simplicity of text descriptions used to describe video categories, which results in insufficient contextual description ability. In this paper, we propose a language-video contrastive learning model based on spatio-temporal auxiliary information supervision. For video encoder, we propose a category token-based temporal weighted displacement module for temporal modeling, which enables temporal information to be propagated at various levels of the network from the bottom to the top. Furthermore, we propose a spatiotemporal information auxiliary supervision module to deeply explore the rich spatio-temporal information embedded in visual tokens. For language encoder, we propose a prompt learning method based on large-scale language pre-training models to extend action category text descriptions and generate text descriptions with rich contextual semantic information. The experiment has achieved better results than the current most advanced methods on four video action recognition datasets, namely, mini-Kinetics-200, Kinetics-400, UCF101, and HMDB51, and it is better than or comparable to the current state-of-the-art method, and the accuracy is 2.5%, 0.3%, 0.6% and 2.4% higher than the baseline, respectively. © 2024 Science Press. All rights reserved.
引用
收藏
页码:1769 / 1785
相关论文
共 50 条
  • [1] Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision
    Yuan, Liangzhe
    Qian, Rui
    Cui, Yin
    Gong, Boqing
    Schroff, Florian
    Yang, Ming-Hsuan
    Adam, Hartwig
    Liu, Ting
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13957 - 13966
  • [2] Spatio-Temporal Meta Contrastive Learning
    Tang, Jiabin
    Xia, Lianghao
    Hu, Jie
    Huang, Chao
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 2412 - 2421
  • [3] Contrastive Spatio-Temporal Pretext Learning for Self-Supervised Video Representation
    Zhang, Yujia
    Po, Lai-Man
    Xu, Xuyuan
    Liu, Mengyang
    Wang, Yexin
    Ou, Weifeng
    Zhao, Yuzhi
    Yu, Wing-Yin
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3380 - 3389
  • [4] Dual Contrastive Learning for Spatio-temporal Representation
    Ding, Shuangrui
    Qian, Rui
    Xiong, Hongkai
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5649 - 5658
  • [5] A survey: Video alignment based on spatio-temporal information
    Shu-Jiang, Zhang
    Jing-Long, Yan
    Hui, Xing
    ISTM/2007: 7TH INTERNATIONAL SYMPOSIUM ON TEST AND MEASUREMENT, VOLS 1-7, CONFERENCE PROCEEDINGS, 2007, : 1793 - 1796
  • [6] Spatio-Temporal Information for Action Recognition in Thermal Video Using Deep Learning Model
    Srihari, P.
    Harikiran, J.
    INTERNATIONAL JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING SYSTEMS, 2022, 13 (08) : 669 - 680
  • [7] Spatio-Temporal Contrastive Learning for Compositional Action Recognition
    Gong, Yezi
    Pei, Mingtao
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII, 2025, 15037 : 424 - 438
  • [8] Video segmentation using spatio-temporal information
    Kim, YW
    Ho, YS
    IEEE TENCON'97 - IEEE REGIONAL 10 ANNUAL CONFERENCE, PROCEEDINGS, VOLS 1 AND 2: SPEECH AND IMAGE TECHNOLOGIES FOR COMPUTING AND TELECOMMUNICATIONS, 1997, : 785 - 788
  • [9] Spatio-temporal reasoning based spatio-temporal information management middleware
    Wang, SS
    Liu, DY
    Wang, Z
    ADVANCED WEB TECHNOLOGIES AND APPLICATIONS, 2004, 3007 : 436 - 441
  • [10] Adversarial Spatio-Temporal Learning for Video Deblurring
    Zhang, Kaihao
    Luo, Wenhan
    Zhong, Yiran
    Ma, Lin
    Liu, Wei
    Li, Hongdong
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (01) : 291 - 301