A Multitemporal Scale and Spatial-Temporal Transformer Network for Temporal Action Localization

被引:4
|
作者
Gao, Zan [1 ,2 ]
Cui, Xinglei [1 ]
Zhuo, Tao [1 ]
Cheng, Zhiyong [1 ]
Liu, An-An [3 ]
Wang, Meng [4 ]
Chen, Shenyong [2 ]
机构
[1] Qilu Univ Technol, Shandong Artificial Intelligence Inst, Shandong Acad Sci, Jinan 250014, Peoples R China
[2] Tianjin Univ Technol, Key Lab Comp Vis & Syst, Minist Educ, Tianjin 300384, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[4] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Semantics; Feature extraction; Proposals; Location awareness; Convolution; Task analysis; Frame-level self-attention (FSA); multiple temporal scales; refined feature pyramids (RFPs); spatial-temporal transformer (STT); temporal action localization (TAL); ACTION RECOGNITION; GRANULARITY;
D O I
10.1109/THMS.2023.3266037
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal action localization plays an important role in video analysis, which aims to localize and classify actions in untrimmed videos. Previous methods often predict actions on a feature space of a single temporal scale. However, the temporal features of a low-level scale lack sufficient semantics for action classification, while a high-level scale cannot provide the rich details of the action boundaries. In addition, the long-range dependencies of video frames are often ignored. To address these issues, a novel multitemporal-scale spatial-temporal transformer (MSST) network is proposed for temporal action localization, which predicts actions on a feature space of multiple temporal scales. Specifically, we first use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales. Second, to establish the long temporal scale of the entire video, we use a spatial-temporal transformer encoder to capture the long-range dependencies of video frames. Then, the refined features with long-range dependencies are fed into a classifier for coarse action prediction. Finally, to further improve the prediction accuracy, we propose a frame-level self-attention module to refine the classification and boundaries of each action instance. Most importantly, these three modules are jointly explored in a unified framework, and MSST has an anchor-free and end-to-end architecture. Extensive experiments show that the proposed method can outperform state-of-the-art approaches on the THUMOS14 dataset and achieve comparable performance on the ActivityNet1.3 dataset. Compared with A2Net (TIP20, Avg{0.3:0.7}), Sub-Action (CSVT2022, Avg{0.1:0.5}), and AFSD (CVPR21, Avg{0.3:0.7}) on the THUMOS14 dataset, the proposed method can achieve improvements of 12.6%, 17.4%, and 2.2%, respectively.
引用
收藏
页码:569 / 580
页数:12
相关论文
共 50 条
  • [1] Spatial-temporal Graph Transformer Network for Spatial-temporal Forecasting
    Dao, Minh-Son
    Zetsu, Koji
    Hoang, Duy-Tang
    Proceedings - 2024 IEEE International Conference on Big Data, BigData 2024, 2024, : 1276 - 1281
  • [2] Fast Spatial-Temporal Transformer Network
    Escher, Rafael Molossi
    de Bem, Rodrigo Andrade
    Jorge Drews Jr, Paulo Lilles
    2021 34TH SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI 2021), 2021, : 65 - 72
  • [3] Spatial-temporal graph transformer network for skeleton-based temporal action segmentation
    Tian, Xiaoyan
    Jin, Ye
    Zhang, Zhao
    Liu, Peng
    Tang, Xianglong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (15) : 44273 - 44297
  • [4] Spatial-temporal graph transformer network for skeleton-based temporal action segmentation
    Xiaoyan Tian
    Ye Jin
    Zhao Zhang
    Peng Liu
    Xianglong Tang
    Multimedia Tools and Applications, 2024, 83 : 44273 - 44297
  • [5] Spatial-Temporal Transformer Network for Continuous Action Recognition in Industrial Assembly
    Huang, Jianfeng
    Liu, Xiang
    Hu, Huan
    Tang, Shanghua
    Li, Chenyang
    Zhao, Shaoan
    Lin, Yimin
    Wang, Kai
    Liu, Zhaoxiang
    Lian, Shiguo
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT X, ICIC 2024, 2024, 14871 : 114 - 130
  • [6] STAN: Spatial-Temporal Awareness Network for Temporal Action Detection
    Liu, Minghao
    Liu, Haiyi
    Zhao, Sirui
    Ma, Fei
    Li, Minglei
    Dai, Zonghong
    Wang, Hao
    Xu, Tong
    Chen, Enhong
    PROCEEDINGS OF THE 6TH INTERNATIONAL WORKSHOP ON MULTIMEDIA CONTENT ANALYSIS IN SPORTS, MMSPORTS 2023, 2023, : 161 - 165
  • [7] TEST: Temporal-spatial separated transformer for temporal action localization
    Wan, Herun
    Luo, Minnan
    Li, Zhihui
    Wang, Yang
    NEUROCOMPUTING, 2025, 614
  • [8] Multi-Scale Spatial-Temporal Transformer: A Novel Framework for Spatial-Temporal Edge Data Prediction
    Ming, Junhao
    Zhang, Dongmei
    Han, Wei
    APPLIED SCIENCES-BASEL, 2023, 13 (17):
  • [9] Graph Spatial-Temporal Transformer Network for Traffic Prediction
    Zhao, Zhenzhen
    Shen, Guojiang
    Wang, Lei
    Kong, Xiangjie
    BIG DATA RESEARCH, 2024, 36
  • [10] Hierarchy Spatial-Temporal Transformer for Action Recognition in Short Videos
    Cai, Guoyong
    Cai, Yumeng
    FUZZY SYSTEMS AND DATA MINING VI, 2020, 331 : 760 - 774