Video Visual Relation Detection via Multi-modal Feature Fusion

被引:32
|
作者
Sun, Xu [1 ,2 ]
Ren, Tongwei [1 ,2 ]
Zi, Yuan [1 ]
Wu, Gangshan [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China
[2] Nanjing Univ, Shenzhen Res Inst, Shenzhen, Peoples R China
基金
美国国家科学基金会;
关键词
Video visual relation detection; object trajectory detection; relation prediction;
D O I
10.1145/3343031.3356076
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video visual relation detection is a meaningful research problem, which aims to build a bridge between dynamic vision and language. In this paper, we propose a novel video visual relation detection method with multi-model feature fusion. First, we detect objects on each frame densely with the state-of-the-art video object detection model, flow-guided feature aggregation (FGFA), and generate object trajectories by linking the temporally independent objects with Seq-NMS and KCF tracker. Next, we break the relation candidates, i.e., co-occurrent object trajectory pairs, into short-term segments and predict relations with spatial-temporal feature and language context feature. Finally, we greedily associate the short-term relation segments into complete relation instances. The experiment results show that our proposed method outperforms other methods by a large margin, which also earned us the first place in visual relation detection task of Video Relation Understanding Challenge (VRU), ACMMM 2019.
引用
收藏
页码:2657 / 2661
页数:5
相关论文
共 50 条
  • [41] Fabric image retrieval based on multi-modal feature fusion
    Zhang, Ning
    Liu, Yixin
    Li, Zhongjian
    Xiang, Jun
    Pan, Ruru
    SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (03) : 2207 - 2217
  • [42] Feature Disentanglement and Adaptive Fusion for Improving Multi-modal Tracking
    Li, Zheng
    Cai, Weibo
    Dong, Junhao
    Lai, Jianhuang
    Xie, Xiaohua
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT XII, 2024, 14436 : 68 - 80
  • [43] Multi-modal classifier fusion with feature cooperation for glaucoma diagnosis
    Benzebouchi, Nacer Eddine
    Azizi, Nabiha
    Ashour, Amira S.
    Dey, Nilanjan
    Sherratt, R. Simon
    JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2019, 31 (06) : 841 - 874
  • [44] Heterogeneous Feature Fusion Approach for Multi-Modal Indoor Localization
    Zhou, Junyi
    Huang, Kaixuan
    Tang, Siyu
    Zhang, Shunqing
    2024 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE, WCNC 2024, 2024,
  • [45] Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation
    Zhong, Zeyun
    Schneider, David
    Voit, Michael
    Stiefelhagen, Rainer
    Beyerer, Juergen
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 6057 - 6066
  • [46] Multi-modal fusion in ergonomic health: bridging visual and pressure for sitting posture detection
    Quan, Qinxiao
    Gao, Yang
    Bai, Yang
    Jin, Zhanpeng
    CCF TRANSACTIONS ON PERVASIVE COMPUTING AND INTERACTION, 2024, : 380 - 393
  • [47] Joint and Individual Feature Fusion Hashing for Multi-modal Retrieval
    Jun Yu
    Yukun Zheng
    Yinglin Wang
    Zuhe Li
    Liang Zhu
    Cognitive Computation, 2023, 15 : 1053 - 1064
  • [48] Predicting Depression Severity by Multi-Modal Feature Engineering and Fusion
    Samareh, Aven
    Jin, Yan
    Wang, Zhangyang
    Chang, Xiangyu
    Huang, Shuai
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 8147 - 8148
  • [49] Geological Body Recognition Based on Multi-Modal Feature Fusion
    Fu S.
    Li C.
    Zhang H.
    Liu C.
    Li F.
    Diqiu Kexue - Zhongguo Dizhi Daxue Xuebao/Earth Science - Journal of China University of Geosciences, 2023, 48 (10): : 3743 - 3752
  • [50] Multi-modal Action Segmentation in the Kitchen with a Feature Fusion Approach
    Kogure, Shunsuke
    Aoki, Yoshimitsu
    FIFTEENTH INTERNATIONAL CONFERENCE ON QUALITY CONTROL BY ARTIFICIAL VISION, 2021, 11794