Video Visual Relation Detection via Multi-modal Feature Fusion

被引:32
|
作者
Sun, Xu [1 ,2 ]
Ren, Tongwei [1 ,2 ]
Zi, Yuan [1 ]
Wu, Gangshan [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China
[2] Nanjing Univ, Shenzhen Res Inst, Shenzhen, Peoples R China
基金
美国国家科学基金会;
关键词
Video visual relation detection; object trajectory detection; relation prediction;
D O I
10.1145/3343031.3356076
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video visual relation detection is a meaningful research problem, which aims to build a bridge between dynamic vision and language. In this paper, we propose a novel video visual relation detection method with multi-model feature fusion. First, we detect objects on each frame densely with the state-of-the-art video object detection model, flow-guided feature aggregation (FGFA), and generate object trajectories by linking the temporally independent objects with Seq-NMS and KCF tracker. Next, we break the relation candidates, i.e., co-occurrent object trajectory pairs, into short-term segments and predict relations with spatial-temporal feature and language context feature. Finally, we greedily associate the short-term relation segments into complete relation instances. The experiment results show that our proposed method outperforms other methods by a large margin, which also earned us the first place in visual relation detection task of Video Relation Understanding Challenge (VRU), ACMMM 2019.
引用
收藏
页码:2657 / 2661
页数:5
相关论文
共 50 条
  • [1] Online video visual relation detection with hierarchical multi-modal fusion
    He, Yuxuan
    Gan, Ming-Gang
    Ma, Qianzhao
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (24) : 65707 - 65727
  • [2] Enhancing multi-modal fusion in visual dialog via sample debiasing and feature interaction
    Lu, Chenyu
    Yin, Jun
    Yang, Hao
    Sun, Shiliang
    INFORMATION FUSION, 2024, 107
  • [3] Multi-modal fusion for video understanding
    Hoogs, A
    Mundy, J
    Cross, G
    30TH APPLIED IMAGERY PATTERN RECOGNITION WORKSHOP, PROCEEDINGS: ANALYSIS AND UNDERSTANDING OF TIME VARYING IMAGERY, 2001, : 103 - 108
  • [4] Multi-level and Multi-modal Target Detection Based on Feature Fusion
    Cheng T.
    Sun L.
    Hou D.
    Shi Q.
    Zhang J.
    Chen J.
    Huang H.
    Qiche Gongcheng/Automotive Engineering, 2021, 43 (11): : 1602 - 1610
  • [5] A Novel Deep Multi-Modal Feature Fusion Method for Celebrity Video Identification
    Chen, Jianrong
    Yang, Li
    Xu, Yuanyuan
    Huo, Jing
    Shi, Yinghuan
    Gao, Yang
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2535 - 2538
  • [6] Citrus Huanglongbing Detection Based on Multi-Modal Feature Fusion Learning
    Yang, Dongzi
    Wang, Fengcheng
    Hu, Yuqi
    Lan, Yubin
    Deng, Xiaoling
    FRONTIERS IN PLANT SCIENCE, 2021, 12
  • [7] MULTI-MODAL FEATURE FUSION NETWORK FOR GHOST IMAGING OBJECT DETECTION
    Hu, Nan
    Ma, Huimin
    Le, Chao
    Shao, Xuehui
    2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 351 - 355
  • [8] Video Relation Detection with Trajectory-aware Multi-modal Features
    Xie, Wentao
    Ren, Guanghui
    Liu, Si
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4590 - 4594
  • [9] Visual-guided hierarchical iterative fusion for multi-modal video action
    Zhang, Bingbing
    Zhang, Ying
    Zhang, Jianxin
    Sun, Qiule
    Wang, Rong
    Zhang, Qiang
    PATTERN RECOGNITION LETTERS, 2024, 186 : 213 - 220
  • [10] Learning Visual Emotion Distributions via Multi-Modal Features Fusion
    Zhao, Sicheng
    Ding, Guiguang
    Gao, Yue
    Han, Jungong
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 369 - 377