End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation

被引：0

作者：

Wu, Mingrui ^{[1
,2
]}

Gu, Jiaxin ^{[3
]}

Shen, Yunhang ^{[2
]}

Lin, Mingbao ^{[2
]}

Chen, Chao ^{[2
]}

Sun, Xiaoshuai ^{[1
,4
,5
]}

机构：

[1] Xiamen Univ, Sch Informat, MAC Lab, Xiamen, Peoples R China

[2] Tencent, Youtu Lab, Shenzhen, Peoples R China

[3] VIS Baidu Inc, Beijing, Peoples R China

[4] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China

[5] Xiamen Univ, Fujian Engn Res Ctr Trusted Artificial Intelligen, Xiamen, Peoples R China

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3 | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most existing Human-Object Interaction (HOI) Detection methods rely heavily on full annotations with predefined HOI categories, which is limited in diversity and costly to scale further. We aim at advancing zero-shot HOI detection to detect both seen and unseen HOIs simultaneously. The fundamental challenges are to discover potential human-object pairs and identify novel HOI categories. To overcome the above challenges, we propose a novel End-to-end zero-shot HOI Detection (EoID) framework via vision-language knowledge distillation. We first design an Interactive Score module combined with a Two-stage Bipartite Matching algorithm to achieve interaction distinguishment for human-object pairs in an action-agnostic manner. Then we transfer the distribution of action probability from the pretrained vision-language teacher as well as the seen ground truth to the HOI model to attain zero-shot HOI classification. Extensive experiments on HICO-Det dataset demonstrate that our model discovers potential interactive pairs and enables the recognition of unseen HOIs. Finally, our EoID outperforms the previous SOTAs under various zero-shot settings. Moreover, our method is generalizable to large-scale object detection data to further scale up the action sets. The source code is available at: https://github.com/mrwu-mac/EoID.

引用

页码：2839 / 2846

页数：8

共 50 条

[41] Boosting End-to-end Multi-Object Tracking and Person Search via Knowledge Distillation
Zhang, Wei
He, Lingxiao
Cheng, Peng
Liao, Xingyu
Liu, Wu
Li, Qi
Sun, Zhenan
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1192 - 1201
[42] Sequence-Level Knowledge Distillation for Class-Incremental End-to-End Spoken Language Understanding
Cappellazzo, Umberto
Yang, Muqiao
Falavigna, Daniele
Brutti, Alessio
INTERSPEECH 2023, 2023, : 2953 - 2957
[43] Zero-shot test time adaptation via knowledge distillation for personalized speech denoising and dereverberation
Kim, Sunwoo
Athi, Mrudula
Shi, Guangji
Kim, Minje
Kristjansson, Trausti
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2024, 155 (02): : 1353 - 1367
[44] Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation
Inaguma, Hirofumi
Kawahara, Tatsuya
Watanabe, Shinji
2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 1872 - 1881
[45] TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition
Yoon, Ji Won
Lee, Hyeonseung
Kim, Hyung Yong
Cho, Won Ik
Kim, Nam Soo
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 (29) : 1626 - 1638
[46] A Lightweight Framework With Knowledge Distillation for Zero-Shot Mars Scene Classification
Tan, Xiaomeng
Xi, Bobo
Xu, Haitao
Li, Jiaojiao
Li, Yunsong
Xue, Changbin
Chanussot, Jocelyn
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
[47] Zero-Shot Hashing via Transferring Supervised Knowledge
Yang, Yang
Luo, Yadan
Chen, Weilun
Shen, Fumin
Shao, Jie
Shen, Heng Tao
MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, : 1286 - 1295
[48] Zero-shot Learning via Recurrent Knowledge Transfer
Zhao, Bo
Sun, Xinwei
Hong, Xiaopeng
Yao, Yuan
Wang, Yizhou
2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 1308 - 1317
[49] END-TO-END VOICE CONVERSION VIA CROSS-MODAL KNOWLEDGE DISTILLATION FOR DYSARTHRIC SPEECH RECONSTRUCTION
Wang, Disong
Yu, Jianwei
Wu, Xixin
Liu, Songxiang
Sung, Lifa
Liu, Xunying
Meng, Helen
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7744 - 7748
[50] Multi-domain Knowledge Distillation via Uncertainty-Matching for End-to-End ASR Models
Kim, Ho-Gyeong
Lee, Min-Joong
Lee, Hoshik
Kang, Tae Gyoon
Lee, Jihyun
Yang, Eunho
Hwang, Sung Ju
INTERSPEECH 2021, 2021, : 2531 - 2535

← 1 2 3 4 5 →