Rethinking vision transformer through human-object interaction detection

被引:3
|
作者
Cheng, Yamin [1 ]
Zhao, Zitian [1 ]
Wang, Zhi [1 ]
Duan, Hancong [1 ]
机构
[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Peoples R China
关键词
Human-object interaction; Vision transformer; NETWORK;
D O I
10.1016/j.engappai.2023.106123
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent works have shown that Vision Transformer models (ViT) can achieve comparable or even superior performance on image-and region-level recognition tasks, i.e., image recognition and object detection. However, can Vision Transformer perform region-level relationship reasoning with minimal information about the spatial geometry formation of input images? To answer this question, we propose the Region-level Relationship Reasoning Vision Transformer (R3ViT), a family of human-object interaction detection models based on the vanilla Vision Transformer with the fewest possible revisions, common region priors, as well as inductive biases of the objective task. Specifically, we first divide the input images into several local patches, replace the specialized [CLS ] token in vanilla ViT with extra relationship semantics carrier tokens in the entanglement-/pair-/triplet-wise manner and calculate both representations and their relevance. We assign each extra token with an individual supervision and compute the training loss in a dense manner. We find the vision transformer simply adjusted by the novel paradigm can already reason about the region-level visual relationship, e.g., R3ViT can achieve quite excellent performance on the challenging human-object interaction detection benchmark. We also discuss the impacts of adjustment schemes and model scaling strategies for Vision Transformer through R3ViT. Numerically, extensive experiments on several benchmarks demonstrate that our proposed framework outperforms most existing methods and achieves the impressive performance of 28.91 mAP on HICO-DET and 56.8 mAP on V-COCO dataset, respectively.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] Improved Human-Object Interaction Detection Through On-the-Fly Stacked Generalization
    Lee, Geonu
    Yun, Kimin
    Cho, Jungchan
    IEEE ACCESS, 2021, 9 : 34251 - 34263
  • [32] Learning Human-Object Interaction Detection using Interaction Points
    Wang, Tiancai
    Yang, Tong
    Danelljan, Martin
    Khan, Fahad Shahbaz
    Zhang, Xiangyu
    Sun, Jian
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4115 - 4124
  • [33] A Survey of Human-Object Interaction Detection With Deep Learning
    Han, Geng
    Zhao, Jiachen
    Zhang, Lele
    Deng, Fang
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2025, 9 (01): : 3 - 26
  • [34] Relational Context Learning for Human-Object Interaction Detection
    Kim, Sanghyun
    Jung, Deunsol
    Cho, Minsu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2925 - 2934
  • [35] Neural-Logic Human-Object Interaction Detection
    Li, Liulei
    Wei, Jianan
    Wang, Wenguan
    Yang, Yi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [36] Structured LSTM for Human-Object Interaction Detection and Anticipation
    Anh Minh Truong
    Yoshitaka, Atsuo
    2017 14TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2017,
  • [37] Deep Contextual Attention for Human-Object Interaction Detection
    Wang, Tiancai
    Anwer, Rao Muhammad
    Khan, Muhammad Haris
    Khan, Fahad Shahbaz
    Pang, Yanwei
    Shao, Ling
    Laaksonen, Jorma
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5693 - 5701
  • [38] Spatial-Net for Human-Object Interaction Detection
    Mansour, Ahmed E.
    Mohammed, Ammar
    Elsayed, Hussein Abd El Atty
    Elramly, Salwa
    IEEE ACCESS, 2022, 10 : 88920 - 88931
  • [39] Parallel disentangling network for human-object interaction detection
    Cheng, Yamin
    Duan, Hancong
    Wang, Chen
    Chen, Zhijun
    PATTERN RECOGNITION, 2024, 146
  • [40] Human-Object Interaction Detection Based on Star Graph
    Cai, Shuang
    Ma, Shiwei
    Gu, Dongzhou
    Wang, Chang
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2022, 36 (09)