Video Visual Relation Detection via Multi-modal Feature Fusion

被引：32

作者：

Sun, Xu ^{[1
,2
]}

Ren, Tongwei ^{[1
,2
]}

Zi, Yuan ^{[1
]}

Wu, Gangshan ^{[1
]}

机构：

[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China

[2] Nanjing Univ, Shenzhen Res Inst, Shenzhen, Peoples R China

来源：

PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19) | 2019年

基金：

美国国家科学基金会;

关键词：

Video visual relation detection; object trajectory detection; relation prediction;

D O I：

10.1145/3343031.3356076

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Video visual relation detection is a meaningful research problem, which aims to build a bridge between dynamic vision and language. In this paper, we propose a novel video visual relation detection method with multi-model feature fusion. First, we detect objects on each frame densely with the state-of-the-art video object detection model, flow-guided feature aggregation (FGFA), and generate object trajectories by linking the temporally independent objects with Seq-NMS and KCF tracker. Next, we break the relation candidates, i.e., co-occurrent object trajectory pairs, into short-term segments and predict relations with spatial-temporal feature and language context feature. Finally, we greedily associate the short-term relation segments into complete relation instances. The experiment results show that our proposed method outperforms other methods by a large margin, which also earned us the first place in visual relation detection task of Video Relation Understanding Challenge (VRU), ACMMM 2019.

引用

页码：2657 / 2661

页数：5

共 50 条

[21] Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection
Yang, Shuo
Wang, Yongqi
Ji, Xiaofeng
Wu, Xinxiao
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 6513 - 6521
[22] VTMF2N: Towards Accurate Visual-Tactile Slip Detection via Multi-modal Feature Fusion in Robotic Grasping
Tang, Qi'an
Chen, Lu
Liu, Jingyang
Wang, Huaiyao
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X, 2025, 15040 : 103 - 117
[23] Visual Relation Extraction via Multi-modal Translation Embedding Based Model
Li, Zhichao
Han, Yuping
Xu, Yajing
Gao, Sheng
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2018, PT I, 2018, 10937 : 538 - 548
[24] Lightweight video salient object detection via channel-shuffle enhanced multi-modal fusion network
Huang, Kan
Xu, Zhijing
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (1) : 1025 - 1039
[25] Lightweight video salient object detection via channel-shuffle enhanced multi-modal fusion network
Kan Huang
Zhijing Xu
Multimedia Tools and Applications, 2024, 83 : 1025 - 1039
[26] Multi-modal Gait Recognition via Effective Spatial-Temporal Feature Fusion
Cui, Yufeng
Kang, Yimei
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 17949 - 17957
[27] Multi-Modal Generative DeepFake Detection via Visual-Language Pretraining with Gate Fusion for Cognitive Computation
Zhang, Guisheng
Gao, Mingliang
Li, Qilei
Zhai, Wenzhe
Jeon, Gwanggil
COGNITIVE COMPUTATION, 2024, 16 (06) : 2953 - 2966
[28] Multi-Modal Weights Sharing and Hierarchical Feature Fusion for RGBD Salient Object Detection
Xiao, Fen
Li, Bin
Peng, Yimu
Cao, Chunhong
Hu, Kai
Gao, Xieping
IEEE ACCESS, 2020, 8 : 26602 - 26611
[29] Multi-modal voice pathology detection architecture based on deep and handcrafted feature fusion
Omeroglu, Asli Nur
Mohammed, Hussein M. A.
Oral, Emin Argun
ENGINEERING SCIENCE AND TECHNOLOGY-AN INTERNATIONAL JOURNAL-JESTECH, 2022, 36
[30] Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection
Li, Xin
Shi, Botian
Hou, Yuenan
Wu, Xingjiao
Ma, Tianlong
Li, Yikang
He, Liang
COMPUTER VISION, ECCV 2022, PT XXXVIII, 2022, 13698 : 691 - 707

← 1 2 3 4 5 →