Rethinking vision transformer through human-object interaction detection

被引:3
|
作者
Cheng, Yamin [1 ]
Zhao, Zitian [1 ]
Wang, Zhi [1 ]
Duan, Hancong [1 ]
机构
[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Peoples R China
关键词
Human-object interaction; Vision transformer; NETWORK;
D O I
10.1016/j.engappai.2023.106123
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent works have shown that Vision Transformer models (ViT) can achieve comparable or even superior performance on image-and region-level recognition tasks, i.e., image recognition and object detection. However, can Vision Transformer perform region-level relationship reasoning with minimal information about the spatial geometry formation of input images? To answer this question, we propose the Region-level Relationship Reasoning Vision Transformer (R3ViT), a family of human-object interaction detection models based on the vanilla Vision Transformer with the fewest possible revisions, common region priors, as well as inductive biases of the objective task. Specifically, we first divide the input images into several local patches, replace the specialized [CLS ] token in vanilla ViT with extra relationship semantics carrier tokens in the entanglement-/pair-/triplet-wise manner and calculate both representations and their relevance. We assign each extra token with an individual supervision and compute the training loss in a dense manner. We find the vision transformer simply adjusted by the novel paradigm can already reason about the region-level visual relationship, e.g., R3ViT can achieve quite excellent performance on the challenging human-object interaction detection benchmark. We also discuss the impacts of adjustment schemes and model scaling strategies for Vision Transformer through R3ViT. Numerically, extensive experiments on several benchmarks demonstrate that our proposed framework outperforms most existing methods and achieves the impressive performance of 28.91 mAP on HICO-DET and 56.8 mAP on V-COCO dataset, respectively.
引用
收藏
页数:9
相关论文
共 50 条
  • [21] Distance Matters in Human-Object Interaction Detection
    Wang, Guangzhi
    Guo, Yangyang
    Wong, Yongkang
    Kankanhalli, Mohan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4546 - 4554
  • [22] Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection
    Qu, Xian
    Ding, Changxing
    Li, Xingao
    Zhong, Xubin
    Tao, Dacheng
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19536 - 19545
  • [23] Human-object interaction detection with missing objects
    Kogashi, Kaen
    Wu, Yang
    Nobuhara, Shohei
    Nishino, Ko
    IMAGE AND VISION COMPUTING, 2021, 113
  • [24] Diagnosing Rarity in Human-object Interaction Detection
    Kilickaya, Mert
    Smeulders, Arnold
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 3956 - 3960
  • [25] Human-Object Interaction Detection with Missing Objects
    Kogashi, Kaen
    Wu, Yang
    Nobuhara, Shohei
    Nishino, Ko
    PROCEEDINGS OF 17TH INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS (MVA 2021), 2021,
  • [26] Parallel Queries for Human-Object Interaction Detection
    Chen, Junwen
    Yanai, Keiji
    PROCEEDINGS OF THE 4TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA IN ASIA, MMASIA 2022, 2022,
  • [27] Lifelong Learning for Human-Object Interaction Detection
    Sun, Bo
    Lu, Sixu
    He, Jun
    Yu, Lejun
    2022 IEEE 10TH INTERNATIONAL CONFERENCE ON INFORMATION, COMMUNICATION AND NETWORKS (ICICN 2022), 2022, : 582 - 587
  • [28] You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection
    Fang, Yuxin
    Liao, Bencheng
    Wang, Xinggang
    Fang, Jiemin
    Qi, Jiyang
    Wu, Rui
    Niu, Jianwei
    Liu, Wenyu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [29] MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection
    Kim, Bumsoo
    Mun, Jonghwan
    On, Kyoung-Woon
    Shin, Minchul
    Lee, Junhyun
    Kim, Eun-Sol
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19556 - 19565
  • [30] Discovering Syntactic Interaction Clues for Human-Object Interaction Detection
    Lu, Jinguo
    Ren, Weihong
    Jiang, Weibo
    Chen, Xi'ai
    Wang, Qiang
    Han, Zhi
    Liu, Honghai
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 28212 - 28222