Rethinking vision transformer through human-object interaction detection

被引：3

作者：

Cheng, Yamin ^{[1
]}

Zhao, Zitian ^{[1
]}

Wang, Zhi ^{[1
]}

Duan, Hancong ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Peoples R China

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2023年 / 122卷

关键词：

Human-object interaction; Vision transformer; NETWORK;

D O I：

10.1016/j.engappai.2023.106123

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recent works have shown that Vision Transformer models (ViT) can achieve comparable or even superior performance on image-and region-level recognition tasks, i.e., image recognition and object detection. However, can Vision Transformer perform region-level relationship reasoning with minimal information about the spatial geometry formation of input images? To answer this question, we propose the Region-level Relationship Reasoning Vision Transformer (R3ViT), a family of human-object interaction detection models based on the vanilla Vision Transformer with the fewest possible revisions, common region priors, as well as inductive biases of the objective task. Specifically, we first divide the input images into several local patches, replace the specialized [CLS ] token in vanilla ViT with extra relationship semantics carrier tokens in the entanglement-/pair-/triplet-wise manner and calculate both representations and their relevance. We assign each extra token with an individual supervision and compute the training loss in a dense manner. We find the vision transformer simply adjusted by the novel paradigm can already reason about the region-level visual relationship, e.g., R3ViT can achieve quite excellent performance on the challenging human-object interaction detection benchmark. We also discuss the impacts of adjustment schemes and model scaling strategies for Vision Transformer through R3ViT. Numerically, extensive experiments on several benchmarks demonstrate that our proposed framework outperforms most existing methods and achieves the impressive performance of 28.91 mAP on HICO-DET and 56.8 mAP on V-COCO dataset, respectively.

引用

页数：9

共 50 条

[41] Transferable Interactiveness Knowledge for Human-Object Interaction Detection
Li, Yong-Lu
Zhou, Siyuan
Huang, Xijie
Xu, Liang
Ma, Ze
Fang, Hao-Shu
Wang, Yan-Feng
Lu, Cewu
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3580 - 3589
[42] Affordance Transfer Learning for Human-Object Interaction Detection
Hou, Zhi
Yu, Baosheng
Qiao, Yu
Peng, Xiaojiang
Tao, Dacheng
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 495 - 504
[43] Spatial-Net for Human-Object Interaction Detection
Mansour, Ahmed E.
Mohammed, Ammar
Elsayed, Hussein Abd El Atty
Elramly, Salwa
IEEE Access, 2022, 10 : 88920 - 88931
[44] Reimagining Violent Action Detection with Human-Object Interaction
Baskaran, Vishnu Monn
Sutopo, Ricky
Lim, JunYi
Lim, Joanne Mun-Yee
Wong, KokSheik
2024 IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE, AVSS 2024, 2024,
[45] Semantic Inference Network for Human-Object Interaction Detection
Liu, Hongyi
Mo, Lisha
Ma, Huimin
IMAGE AND GRAPHICS, ICIG 2019, PT I, 2019, 11901 : 518 - 529
[46] Geometric Features Enhanced Human-Object Interaction Detection
Zhu, Manli
Ho, Edmond S. L.
Chen, Shuang
Yang, Longzhi
Shum, Hubert P. H.
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2024, 73 : 1 - 1
[47] Hierarchical Reasoning Network for Human-Object Interaction Detection
Gao, Yiming
Kuang, Zhanghui
Li, Guanbin
Zhang, Wayne
Lin, Liang
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 8306 - 8317
[48] Transferable Interactiveness Knowledge for Human-Object Interaction Detection
Li, Yong-Lu
Liu, Xinpeng
Wu, Xiaoqian
Huang, Xijie
Xu, Liang
Lu, Cewu
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (07) : 3870 - 3882
[49] Weakly-supervised Human-object Interaction Detection
Sugimoto, Masaki
Furuta, Ryosuke
Taniguchi, Yukinobu
VISAPP: PROCEEDINGS OF THE 16TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS - VOL. 5: VISAPP, 2021, : 293 - 300
[50] Exploiting Scene Graphs for Human-Object Interaction Detection
He, Tao
Gao, Lianli
Song, Jingkuan
Li, Yuan-Fang
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15964 - 15973

← 1 2 3 4 5 →