End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation

被引:0
|
作者
Wu, Mingrui [1 ,2 ]
Gu, Jiaxin [3 ]
Shen, Yunhang [2 ]
Lin, Mingbao [2 ]
Chen, Chao [2 ]
Sun, Xiaoshuai [1 ,4 ,5 ]
机构
[1] Xiamen Univ, Sch Informat, MAC Lab, Xiamen, Peoples R China
[2] Tencent, Youtu Lab, Shenzhen, Peoples R China
[3] VIS Baidu Inc, Beijing, Peoples R China
[4] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China
[5] Xiamen Univ, Fujian Engn Res Ctr Trusted Artificial Intelligen, Xiamen, Peoples R China
来源
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3 | 2023年
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most existing Human-Object Interaction (HOI) Detection methods rely heavily on full annotations with predefined HOI categories, which is limited in diversity and costly to scale further. We aim at advancing zero-shot HOI detection to detect both seen and unseen HOIs simultaneously. The fundamental challenges are to discover potential human-object pairs and identify novel HOI categories. To overcome the above challenges, we propose a novel End-to-end zero-shot HOI Detection (EoID) framework via vision-language knowledge distillation. We first design an Interactive Score module combined with a Two-stage Bipartite Matching algorithm to achieve interaction distinguishment for human-object pairs in an action-agnostic manner. Then we transfer the distribution of action probability from the pretrained vision-language teacher as well as the seen ground truth to the HOI model to attain zero-shot HOI classification. Extensive experiments on HICO-Det dataset demonstrate that our model discovers potential interactive pairs and enables the recognition of unseen HOIs. Finally, our EoID outperforms the previous SOTAs under various zero-shot settings. Moreover, our method is generalizable to large-scale object detection data to further scale up the action sets. The source code is available at: https://github.com/mrwu-mac/EoID.
引用
收藏
页码:2839 / 2846
页数:8
相关论文
共 50 条
  • [41] Boosting End-to-end Multi-Object Tracking and Person Search via Knowledge Distillation
    Zhang, Wei
    He, Lingxiao
    Cheng, Peng
    Liao, Xingyu
    Liu, Wu
    Li, Qi
    Sun, Zhenan
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1192 - 1201
  • [42] Sequence-Level Knowledge Distillation for Class-Incremental End-to-End Spoken Language Understanding
    Cappellazzo, Umberto
    Yang, Muqiao
    Falavigna, Daniele
    Brutti, Alessio
    INTERSPEECH 2023, 2023, : 2953 - 2957
  • [43] Zero-shot test time adaptation via knowledge distillation for personalized speech denoising and dereverberation
    Kim, Sunwoo
    Athi, Mrudula
    Shi, Guangji
    Kim, Minje
    Kristjansson, Trausti
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2024, 155 (02): : 1353 - 1367
  • [44] Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation
    Inaguma, Hirofumi
    Kawahara, Tatsuya
    Watanabe, Shinji
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 1872 - 1881
  • [45] TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition
    Yoon, Ji Won
    Lee, Hyeonseung
    Kim, Hyung Yong
    Cho, Won Ik
    Kim, Nam Soo
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 (29) : 1626 - 1638
  • [46] A Lightweight Framework With Knowledge Distillation for Zero-Shot Mars Scene Classification
    Tan, Xiaomeng
    Xi, Bobo
    Xu, Haitao
    Li, Jiaojiao
    Li, Yunsong
    Xue, Changbin
    Chanussot, Jocelyn
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [47] Zero-Shot Hashing via Transferring Supervised Knowledge
    Yang, Yang
    Luo, Yadan
    Chen, Weilun
    Shen, Fumin
    Shao, Jie
    Shen, Heng Tao
    MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, : 1286 - 1295
  • [48] Zero-shot Learning via Recurrent Knowledge Transfer
    Zhao, Bo
    Sun, Xinwei
    Hong, Xiaopeng
    Yao, Yuan
    Wang, Yizhou
    2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 1308 - 1317
  • [49] END-TO-END VOICE CONVERSION VIA CROSS-MODAL KNOWLEDGE DISTILLATION FOR DYSARTHRIC SPEECH RECONSTRUCTION
    Wang, Disong
    Yu, Jianwei
    Wu, Xixin
    Liu, Songxiang
    Sung, Lifa
    Liu, Xunying
    Meng, Helen
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7744 - 7748
  • [50] Multi-domain Knowledge Distillation via Uncertainty-Matching for End-to-End ASR Models
    Kim, Ho-Gyeong
    Lee, Min-Joong
    Lee, Hoshik
    Kang, Tae Gyoon
    Lee, Jihyun
    Yang, Eunho
    Hwang, Sung Ju
    INTERSPEECH 2021, 2021, : 2531 - 2535