Discovering Syntactic Interaction Clues for Human-Object Interaction Detection

被引：2

作者：

Lu, Jinguo ^{[1
]}

Ren, Weihong ^{[1
,2
]}

Jiang, Weibo ^{[1
]}

Chen, Xi'ai ^{[2
,3
]}

Wang, Qiang ^{[4
,5
]}

Han, Zhi ^{[2
,3
]}

Liu, Honghai ^{[1
]}

机构：

[1] Harbin Inst Technol, Shenzhen, Peoples R China

[2] Shenyang Univ, Shenyang, Peoples R China

[3] Chinese Acad Sci, Shenyang Inst Automat, State Key Lab Robot, Shenyang, Peoples R China

[4] Chinese Acad Sci, Inst Robot, Beijing, Peoples R China

[5] Chinese Acad Sci, Inst Intelligent Mfg, Beijing, Peoples R China

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR52733.2024.02665

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, Vision-Language Model (VLM) has greatly advanced the Human-Object Interaction (HOI) detection. The existing VLM-based HOI detectors typically adopt a handcrafted template (e.g., a photo of a person [action] a/an [object]) to acquire text knowledge through the VLM text encoder. However, such approaches, only encoding the action-specific text prompts in vocabulary level, may suffer from learning ambiguity without exploring the fine-grained clues from the perspective of interaction context. In this paper, we propose a novel method to discover Syntactic Interaction Clues for HOI detection (SICHOI) by using VLM. Specifically, we first investigate what are the essential elements for an interaction context, and then establish a syntactic interaction bank from three levels: spatial relationship, action-oriented posture and situational condition. Further, to align visual features with the syntactic interaction bank, we adopt a multi-view extractor to jointly aggregate visual features from instance, interaction, and image levels accordingly. In addition, we also introduce a dual cross-attention decoder to perform context propagation between text knowledge and visual features, thereby enhancing the HOI detection. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on HICO-DET and V-COCO.

引用

页码：28212 / 28222

页数：11

共 50 条

[31] Hierarchical Reasoning Network for Human-Object Interaction Detection
Gao, Yiming
Kuang, Zhanghui
Li, Guanbin
Zhang, Wayne
Lin, Liang
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 8306 - 8317
[32] Transferable Interactiveness Knowledge for Human-Object Interaction Detection
Li, Yong-Lu
Liu, Xinpeng
Wu, Xiaoqian
Huang, Xijie
Xu, Liang
Lu, Cewu
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (07) : 3870 - 3882
[33] Weakly-supervised Human-object Interaction Detection
Sugimoto, Masaki
Furuta, Ryosuke
Taniguchi, Yukinobu
VISAPP: PROCEEDINGS OF THE 16TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS - VOL. 5: VISAPP, 2021, : 293 - 300
[34] Exploiting Scene Graphs for Human-Object Interaction Detection
He, Tao
Gao, Lianli
Song, Jingkuan
Li, Yuan-Fang
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15964 - 15973
[35] Highlighting Object Category Immunity for the Generalization of Human-Object Interaction Detection
Liu, Xinpeng
Li, Yong-Lu
Lu, Cewu
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1819 - 1827
[36] Discovering Human-Object Interaction Concepts via Self-Compositional Learning
Hou, Zhi
Yu, Baosheng
Tao, Dacheng
COMPUTER VISION - ECCV 2022, PT XXVII, 2022, 13687 : 461 - 478
[37] ERNet: An Efficient and Reliable Human-Object Interaction Detection Network
Lim, JunYi
Baskaran, Vishnu Monn
Lim, Joanne Mun-Yee
Wong, KokSheik
See, John
Tistarelli, Massimo
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 964 - 979
[38] Multi-stream Network for Human-object Interaction Detection
Wang, Chang
Sun, Jinyu
Ma, Shiwei
Lu, Yuqiu
Liu, Wang
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2021, 35 (08)
[39] Polysemy Deciphering Network for Robust Human-Object Interaction Detection
Zhong, Xubin
Ding, Changxing
Qu, Xian
Tao, Dacheng
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (06) : 1910 - 1929
[40] Disentangled Pre-training for Human-Object Interaction Detection
Li, Zhuolong
Li, Xingao
Ding, Changxing
Xu, Xiangmin
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 28191 - 28201

← 1 2 3 4 5 →