Video Visual Relation Detection via Multi-modal Feature Fusion

被引:32
|
作者
Sun, Xu [1 ,2 ]
Ren, Tongwei [1 ,2 ]
Zi, Yuan [1 ]
Wu, Gangshan [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China
[2] Nanjing Univ, Shenzhen Res Inst, Shenzhen, Peoples R China
基金
美国国家科学基金会;
关键词
Video visual relation detection; object trajectory detection; relation prediction;
D O I
10.1145/3343031.3356076
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video visual relation detection is a meaningful research problem, which aims to build a bridge between dynamic vision and language. In this paper, we propose a novel video visual relation detection method with multi-model feature fusion. First, we detect objects on each frame densely with the state-of-the-art video object detection model, flow-guided feature aggregation (FGFA), and generate object trajectories by linking the temporally independent objects with Seq-NMS and KCF tracker. Next, we break the relation candidates, i.e., co-occurrent object trajectory pairs, into short-term segments and predict relations with spatial-temporal feature and language context feature. Finally, we greedily associate the short-term relation segments into complete relation instances. The experiment results show that our proposed method outperforms other methods by a large margin, which also earned us the first place in visual relation detection task of Video Relation Understanding Challenge (VRU), ACMMM 2019.
引用
收藏
页码:2657 / 2661
页数:5
相关论文
共 50 条
  • [21] Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection
    Yang, Shuo
    Wang, Yongqi
    Ji, Xiaofeng
    Wu, Xinxiao
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 6513 - 6521
  • [22] VTMF2N: Towards Accurate Visual-Tactile Slip Detection via Multi-modal Feature Fusion in Robotic Grasping
    Tang, Qi'an
    Chen, Lu
    Liu, Jingyang
    Wang, Huaiyao
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X, 2025, 15040 : 103 - 117
  • [23] Visual Relation Extraction via Multi-modal Translation Embedding Based Model
    Li, Zhichao
    Han, Yuping
    Xu, Yajing
    Gao, Sheng
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2018, PT I, 2018, 10937 : 538 - 548
  • [24] Lightweight video salient object detection via channel-shuffle enhanced multi-modal fusion network
    Huang, Kan
    Xu, Zhijing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (1) : 1025 - 1039
  • [25] Lightweight video salient object detection via channel-shuffle enhanced multi-modal fusion network
    Kan Huang
    Zhijing Xu
    Multimedia Tools and Applications, 2024, 83 : 1025 - 1039
  • [26] Multi-modal Gait Recognition via Effective Spatial-Temporal Feature Fusion
    Cui, Yufeng
    Kang, Yimei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 17949 - 17957
  • [27] Multi-Modal Generative DeepFake Detection via Visual-Language Pretraining with Gate Fusion for Cognitive Computation
    Zhang, Guisheng
    Gao, Mingliang
    Li, Qilei
    Zhai, Wenzhe
    Jeon, Gwanggil
    COGNITIVE COMPUTATION, 2024, 16 (06) : 2953 - 2966
  • [28] Multi-Modal Weights Sharing and Hierarchical Feature Fusion for RGBD Salient Object Detection
    Xiao, Fen
    Li, Bin
    Peng, Yimu
    Cao, Chunhong
    Hu, Kai
    Gao, Xieping
    IEEE ACCESS, 2020, 8 : 26602 - 26611
  • [29] Multi-modal voice pathology detection architecture based on deep and handcrafted feature fusion
    Omeroglu, Asli Nur
    Mohammed, Hussein M. A.
    Oral, Emin Argun
    ENGINEERING SCIENCE AND TECHNOLOGY-AN INTERNATIONAL JOURNAL-JESTECH, 2022, 36
  • [30] Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection
    Li, Xin
    Shi, Botian
    Hou, Yuenan
    Wu, Xingjiao
    Ma, Tianlong
    Li, Yikang
    He, Liang
    COMPUTER VISION, ECCV 2022, PT XXXVIII, 2022, 13698 : 691 - 707