Rethinking vision transformer through human-object interaction detection

被引:3
|
作者
Cheng, Yamin [1 ]
Zhao, Zitian [1 ]
Wang, Zhi [1 ]
Duan, Hancong [1 ]
机构
[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Peoples R China
关键词
Human-object interaction; Vision transformer; NETWORK;
D O I
10.1016/j.engappai.2023.106123
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent works have shown that Vision Transformer models (ViT) can achieve comparable or even superior performance on image-and region-level recognition tasks, i.e., image recognition and object detection. However, can Vision Transformer perform region-level relationship reasoning with minimal information about the spatial geometry formation of input images? To answer this question, we propose the Region-level Relationship Reasoning Vision Transformer (R3ViT), a family of human-object interaction detection models based on the vanilla Vision Transformer with the fewest possible revisions, common region priors, as well as inductive biases of the objective task. Specifically, we first divide the input images into several local patches, replace the specialized [CLS ] token in vanilla ViT with extra relationship semantics carrier tokens in the entanglement-/pair-/triplet-wise manner and calculate both representations and their relevance. We assign each extra token with an individual supervision and compute the training loss in a dense manner. We find the vision transformer simply adjusted by the novel paradigm can already reason about the region-level visual relationship, e.g., R3ViT can achieve quite excellent performance on the challenging human-object interaction detection benchmark. We also discuss the impacts of adjustment schemes and model scaling strategies for Vision Transformer through R3ViT. Numerically, extensive experiments on several benchmarks demonstrate that our proposed framework outperforms most existing methods and achieves the impressive performance of 28.91 mAP on HICO-DET and 56.8 mAP on V-COCO dataset, respectively.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] Transferable Interactiveness Knowledge for Human-Object Interaction Detection
    Li, Yong-Lu
    Zhou, Siyuan
    Huang, Xijie
    Xu, Liang
    Ma, Ze
    Fang, Hao-Shu
    Wang, Yan-Feng
    Lu, Cewu
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3580 - 3589
  • [42] Affordance Transfer Learning for Human-Object Interaction Detection
    Hou, Zhi
    Yu, Baosheng
    Qiao, Yu
    Peng, Xiaojiang
    Tao, Dacheng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 495 - 504
  • [43] Spatial-Net for Human-Object Interaction Detection
    Mansour, Ahmed E.
    Mohammed, Ammar
    Elsayed, Hussein Abd El Atty
    Elramly, Salwa
    IEEE Access, 2022, 10 : 88920 - 88931
  • [44] Reimagining Violent Action Detection with Human-Object Interaction
    Baskaran, Vishnu Monn
    Sutopo, Ricky
    Lim, JunYi
    Lim, Joanne Mun-Yee
    Wong, KokSheik
    2024 IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE, AVSS 2024, 2024,
  • [45] Semantic Inference Network for Human-Object Interaction Detection
    Liu, Hongyi
    Mo, Lisha
    Ma, Huimin
    IMAGE AND GRAPHICS, ICIG 2019, PT I, 2019, 11901 : 518 - 529
  • [46] Geometric Features Enhanced Human-Object Interaction Detection
    Zhu, Manli
    Ho, Edmond S. L.
    Chen, Shuang
    Yang, Longzhi
    Shum, Hubert P. H.
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2024, 73 : 1 - 1
  • [47] Hierarchical Reasoning Network for Human-Object Interaction Detection
    Gao, Yiming
    Kuang, Zhanghui
    Li, Guanbin
    Zhang, Wayne
    Lin, Liang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 8306 - 8317
  • [48] Transferable Interactiveness Knowledge for Human-Object Interaction Detection
    Li, Yong-Lu
    Liu, Xinpeng
    Wu, Xiaoqian
    Huang, Xijie
    Xu, Liang
    Lu, Cewu
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (07) : 3870 - 3882
  • [49] Weakly-supervised Human-object Interaction Detection
    Sugimoto, Masaki
    Furuta, Ryosuke
    Taniguchi, Yukinobu
    VISAPP: PROCEEDINGS OF THE 16TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS - VOL. 5: VISAPP, 2021, : 293 - 300
  • [50] Exploiting Scene Graphs for Human-Object Interaction Detection
    He, Tao
    Gao, Lianli
    Song, Jingkuan
    Li, Yuan-Fang
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15964 - 15973