Video-based spatio-temporal scene graph generation with efficient self-supervision tasks

被引:0
|
作者
Lianggangxu Chen
Yiqing Cai
Changhong Lu
Changbo Wang
Gaoqi He
机构
[1] Chongqing Institute of East China Normal University,Chongqing Key Laboratory of Precision Optics
[2] East China Normal University,School of Computer Science and Technology
[3] East China Normal University,School of Mathematical Sciences
来源
关键词
Spatio-temporal scene graphs generation; Self-supervision; Local relation-aware attention;
D O I
暂无
中图分类号
学科分类号
摘要
Spatio-temporal Scene Graphs Generation (STSGG) aims to extract a sequence of graph-based semantic representations for high-level visual tasks. Existing works often fail to exploit the strong temporal correlation and the details of local features, which leads to the inability to distinguish the action between dynamic relation (e.g., drinking) and static relation (e.g., holding). Furthermore, due to bad long-tailed bias, the prediction results are troubled by inaccurate tail predicates classifications. To address these issues, a slowfast local-aware attention (SFLA) Network is proposed for temporal modeling in STSGG. First, a two-branch network is used to extract static and dynamic relation features respectively. Second, a local relation-aware attention (LRA) module is proposed to attach higher importance to the crucial elements in the local relationship. Third, three novel self-supervision prediction tasks are proposed, that is, spatial location, human attention state, and distance variation. Such self-supervision tasks are trained simultaneously with the main model to alleviate the long-tailed bias problem and enhance feature discrimination. Systematic experiments show that our method achieves state-of-the-art performance in the recently proposed Action Genome (AG) dataset and the popular ImageNet Video dataset.
引用
收藏
页码:38947 / 38966
页数:19
相关论文
共 50 条
  • [21] Video-based Emotion Recognition using Aggregated Features and Spatio-temporal Information
    Xu, Jinchang
    Dong, Yuan
    Ma, Lilei
    Bai, Hongliang
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 2833 - 2838
  • [22] Video-based salient object detection via spatio-temporal difference and coherence
    Huang, Lei
    Luo, Bin
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (09) : 10685 - 10699
  • [23] Exploring the Spatio-Temporal Aware Graph for video captioning
    Xue, Ping
    Zhou, Bing
    IET COMPUTER VISION, 2022, 16 (05) : 456 - 467
  • [24] Exploring Spatio–Temporal Graph Convolution for Video-Based Human–Object Interaction Recognition
    Wang, Ning
    Zhu, Guangming
    Li, Hongsheng
    Feng, Mingtao
    Zhao, Xia
    Ni, Lan
    Shen, Peiyi
    Mei, Lin
    Zhang, Liang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5814 - 5827
  • [25] Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization
    Thoker, Fida Mohammad
    Doughty, Hazel
    Snoek, Cees G. M.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13766 - 13777
  • [26] Video-Based Pedestrian Re-Identification by Adaptive Spatio-Temporal Appearance Model
    Zhang, Wei
    Ma, Bingpeng
    Liu, Kan
    Huang, Rui
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2017, 26 (04) : 2042 - 2054
  • [27] Spatio-Temporal Scene Analysis Based on Graph Algorithms to Determine Rigid and Articulated Objects
    Kieneke, Stephan
    Steffens, Markus
    Aufderheide, Dominik
    Krybus, Werner
    Kohring, Christine
    Morton, Danny
    COMPUTER VISION/COMPUTER GRAPHICS COLLABORATION TECHNIQUES, PROCEEDINGS, 2009, 5496 : 254 - +
  • [28] Video action detection by learning graph-based spatio-temporal interactions
    Tomei, Matteo
    Baraldi, Lorenzo
    Calderara, Simone
    Bronzin, Simone
    Cucchiara, Rita
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 206
  • [29] Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning
    Li, Shun
    Zhang, Ze-Fan
    Ji, Yi
    Li, Ying
    Liu, Chun-Ping
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [30] Video Segmentation Using Iterated Graph Cuts Based on Spatio-temporal Volumes
    Nagahashi, Tomoyuki
    Fujiyoshi, Hironobu
    Kanade, Takeo
    COMPUTER VISION - ACCV 2009, PT II, 2010, 5995 : 655 - +