Rethinking vision transformer through human-object interaction detection

被引:3
|
作者
Cheng, Yamin [1 ]
Zhao, Zitian [1 ]
Wang, Zhi [1 ]
Duan, Hancong [1 ]
机构
[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Peoples R China
关键词
Human-object interaction; Vision transformer; NETWORK;
D O I
10.1016/j.engappai.2023.106123
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent works have shown that Vision Transformer models (ViT) can achieve comparable or even superior performance on image-and region-level recognition tasks, i.e., image recognition and object detection. However, can Vision Transformer perform region-level relationship reasoning with minimal information about the spatial geometry formation of input images? To answer this question, we propose the Region-level Relationship Reasoning Vision Transformer (R3ViT), a family of human-object interaction detection models based on the vanilla Vision Transformer with the fewest possible revisions, common region priors, as well as inductive biases of the objective task. Specifically, we first divide the input images into several local patches, replace the specialized [CLS ] token in vanilla ViT with extra relationship semantics carrier tokens in the entanglement-/pair-/triplet-wise manner and calculate both representations and their relevance. We assign each extra token with an individual supervision and compute the training loss in a dense manner. We find the vision transformer simply adjusted by the novel paradigm can already reason about the region-level visual relationship, e.g., R3ViT can achieve quite excellent performance on the challenging human-object interaction detection benchmark. We also discuss the impacts of adjustment schemes and model scaling strategies for Vision Transformer through R3ViT. Numerically, extensive experiments on several benchmarks demonstrate that our proposed framework outperforms most existing methods and achieves the impressive performance of 28.91 mAP on HICO-DET and 56.8 mAP on V-COCO dataset, respectively.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Agglomerative Transformer for Human-Object Interaction Detection
    Tu, Danyang
    Sun, Wei
    Zhai, Guangtao
    Shen, Wei
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21557 - 21567
  • [2] Enhanced Transformer Interaction Components for Human-Object Interaction Detection
    Zhang, JinHui
    Zhao, Yuxiao
    Zhang, Xian
    Wang, Xiang
    Zhao, Yuxuan
    Wang, Peng
    Hu, Jian
    ACM SYMPOSIUM ON SPATIAL USER INTERACTION, SUI 2023, 2023,
  • [3] Human-Object Interaction Detection via Disentangled Transformer
    Zhou, Desen
    Liu, Zhichao
    Wang, Jian
    Wang, Leshan
    Hu, Tao
    Ding, Errui
    Wang, Jingdong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19546 - 19555
  • [4] Human-Object Interaction Detection with Ratio-Transformer
    Wang, Tianlang
    Lu, Tao
    Fang, Wenhua
    Zhang, Yanduo
    SYMMETRY-BASEL, 2022, 14 (08):
  • [5] Mask-Guided Transformer for Human-Object Interaction Detection
    Ying, Daocheng
    Yang, Hua
    Sun, Jun
    2022 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2022,
  • [6] Learning Human-Object Interaction Detection via Deformable Transformer
    Cai, Shuang
    Ma, Shiwei
    Gu, Dongzhou
    2021 INTERNATIONAL CONFERENCE ON IMAGE, VIDEO PROCESSING, AND ARTIFICIAL INTELLIGENCE, 2021, 12076
  • [7] Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows
    Tu, Danyang
    Min, Xiongkuo
    Duan, Huiyu
    Guo, Guodong
    Zhai, Guangtao
    Shen, Wei
    COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 : 87 - 103
  • [8] Compositional Learning in Transformer-Based Human-Object Interaction Detection
    Zhuang, Zikun
    Qian, Ruihao
    Xie, Chi
    Liang, Shuang
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1038 - 1043
  • [9] Pairwise CNN-Transformer Features for Human-Object Interaction Detection
    Quan, Hutuo
    Lai, Huicheng
    Gao, Guxue
    Ma, Jun
    Li, Junkai
    Chen, Dongji
    ENTROPY, 2024, 26 (03)
  • [10] Human-object interaction detection based on disentangled axial attention transformer
    Xia, Limin
    Xiao, Qiyue
    MACHINE VISION AND APPLICATIONS, 2024, 35 (04)