Rethinking vision transformer through human-object interaction detection

被引：3

作者：

Cheng, Yamin ^{[1
]}

Zhao, Zitian ^{[1
]}

Wang, Zhi ^{[1
]}

Duan, Hancong ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Peoples R China

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2023年 / 122卷

关键词：

Human-object interaction; Vision transformer; NETWORK;

D O I：

10.1016/j.engappai.2023.106123

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recent works have shown that Vision Transformer models (ViT) can achieve comparable or even superior performance on image-and region-level recognition tasks, i.e., image recognition and object detection. However, can Vision Transformer perform region-level relationship reasoning with minimal information about the spatial geometry formation of input images? To answer this question, we propose the Region-level Relationship Reasoning Vision Transformer (R3ViT), a family of human-object interaction detection models based on the vanilla Vision Transformer with the fewest possible revisions, common region priors, as well as inductive biases of the objective task. Specifically, we first divide the input images into several local patches, replace the specialized [CLS ] token in vanilla ViT with extra relationship semantics carrier tokens in the entanglement-/pair-/triplet-wise manner and calculate both representations and their relevance. We assign each extra token with an individual supervision and compute the training loss in a dense manner. We find the vision transformer simply adjusted by the novel paradigm can already reason about the region-level visual relationship, e.g., R3ViT can achieve quite excellent performance on the challenging human-object interaction detection benchmark. We also discuss the impacts of adjustment schemes and model scaling strategies for Vision Transformer through R3ViT. Numerically, extensive experiments on several benchmarks demonstrate that our proposed framework outperforms most existing methods and achieves the impressive performance of 28.91 mAP on HICO-DET and 56.8 mAP on V-COCO dataset, respectively.

引用

页数：9

共 50 条

[31] Improved Human-Object Interaction Detection Through On-the-Fly Stacked Generalization
Lee, Geonu
Yun, Kimin
Cho, Jungchan
IEEE ACCESS, 2021, 9 : 34251 - 34263
[32] Learning Human-Object Interaction Detection using Interaction Points
Wang, Tiancai
Yang, Tong
Danelljan, Martin
Khan, Fahad Shahbaz
Zhang, Xiangyu
Sun, Jian
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4115 - 4124
[33] A Survey of Human-Object Interaction Detection With Deep Learning
Han, Geng
Zhao, Jiachen
Zhang, Lele
Deng, Fang
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2025, 9 (01): : 3 - 26
[34] Relational Context Learning for Human-Object Interaction Detection
Kim, Sanghyun
Jung, Deunsol
Cho, Minsu
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2925 - 2934
[35] Neural-Logic Human-Object Interaction Detection
Li, Liulei
Wei, Jianan
Wang, Wenguan
Yang, Yi
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[36] Structured LSTM for Human-Object Interaction Detection and Anticipation
Anh Minh Truong
Yoshitaka, Atsuo
2017 14TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2017,
[37] Deep Contextual Attention for Human-Object Interaction Detection
Wang, Tiancai
Anwer, Rao Muhammad
Khan, Muhammad Haris
Khan, Fahad Shahbaz
Pang, Yanwei
Shao, Ling
Laaksonen, Jorma
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5693 - 5701
[38] Spatial-Net for Human-Object Interaction Detection
Mansour, Ahmed E.
Mohammed, Ammar
Elsayed, Hussein Abd El Atty
Elramly, Salwa
IEEE ACCESS, 2022, 10 : 88920 - 88931
[39] Parallel disentangling network for human-object interaction detection
Cheng, Yamin
Duan, Hancong
Wang, Chen
Chen, Zhijun
PATTERN RECOGNITION, 2024, 146
[40] Human-Object Interaction Detection Based on Star Graph
Cai, Shuang
Ma, Shiwei
Gu, Dongzhou
Wang, Chang
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2022, 36 (09)

← 1 2 3 4 5 →