Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding

被引:0
|
作者
Bao, Peijun [1 ]
Xia, Yong [2 ]
Yang, Wenhan [3 ]
Ng, Boon Poh [1 ]
Er, Meng Hwa [1 ]
Kot, Alex C. [1 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Northwestern Polytech Univ, Xian, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper for the first time leverages multi-modal videos for weakly-supervised temporal video grounding. As labeling the video moment is labor-intensive and subjective, the weakly-supervised approaches have gained increasing attention in recent years. However, these approaches could inherently compromise performance due to inadequate supervision. Therefore, to tackle this challenge, we for the first time pay attention to exploiting complementary information extracted from multi-modal videos (e.g., RGB frames, optical flows), where richer supervision is naturally introduced in the weakly-supervised context. Our motivation is that by integrating different modalities of the videos, the model is learned from synergic supervision and thereby can attain superior generalization capability. However, addressing multiple modalities would also inevitably introduce additional computational overhead, and might become inapplicable if a particular modality is inaccessible. To solve this issue, we adopt a novel route: building a multi-modal distillation algorithm to capitalize on the multi-modal knowledge as supervision for model training, while still being able to work with only the single modal input during inference. As such, we can utilize the benefits brought by the supplementary nature of multiple modalities, without undermining the applicability in practical scenarios. Specifically, we first propose a cross-modal mutual learning framework and train a sophisticated teacher model to learn collaboratively from the multi-modal videos. Then we identify two sorts of knowledge from the teacher model, i.e., temporal boundaries and semantic activation map. And we devise a local-global distillation algorithm to transfer this knowledge to a student model of single-modal input at both local and global levels. Extensive experiments on large-scale datasets demonstrate that our method achieves state-of-the-art performance with/without multi-modal inputs.
引用
收藏
页码:738 / 746
页数:9
相关论文
共 50 条
  • [21] MTCAM: A Novel Weakly-Supervised Audio-Visual Saliency Prediction Model With Multi-Modal Transformer
    Zhu, Dandan
    Zhu, Kun
    Ding, Weiping
    Zhang, Nana
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (02): : 1756 - 1771
  • [22] SparseMorph: A weakly-supervised lightweight sparse transformer for mono- and multi-modal deformable image registration
    Bai, Xinhao
    Wang, Hongpeng
    Qin, Yanding
    Han, Jianda
    Yu, Ningbo
    Computers in Biology and Medicine, 2024, 182
  • [23] On Pursuit of Designing Multi-modal Transformer for Video Grounding
    Cao, Meng
    Chen, Long
    Shou, Zheng
    Zhang, Can
    Zou, Yuexian
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9810 - 9823
  • [24] Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding
    Yang, Wenfei
    Zhang, Tianzhu
    Zhang, Yongdong
    Wu, Feng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 3252 - 3262
  • [25] Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance Video
    Wu, Jie
    Zhang, Wei
    Li, Guanbin
    Wu, Wenhao
    Tan, Xiao
    Li, Yingying
    Ding, Errui
    Lin, Liang
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 1172 - 1178
  • [26] MARN: Multi-level Attentional Reconstruction Networks for Weakly Supervised Video Temporal Grounding
    Song, Yijun
    Wang, Jingwen
    Ma, Lin
    Yu, Jun
    Liang, Jinxiu
    Yuan, Liu
    Yu, Zhou
    NEUROCOMPUTING, 2023, 554
  • [27] Weakly-supervised learning of multi-modal features for regularised iterative descent in 3D image registration
    Blendowski, Max
    Hansen, Lasse
    Heinrich, Mattias P.
    MEDICAL IMAGE ANALYSIS, 2021, 67
  • [28] Automatic Quantification of Tumour Hypoxia From Multi-Modal Microscopy Images Using Weakly-Supervised Learning Methods
    Carneiro, Gustavo
    Peng, Tingying
    Bayer, Christine
    Navab, Nassir
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2017, 36 (07) : 1405 - 1417
  • [29] Weakly-supervised learning of multi-modal features for regularised iterative descent in 3D image registration
    Blendowski, Max
    Hansen, Lasse
    Heinrich, Mattias P.
    Medical Image Analysis, 2021, 67
  • [30] Dual-guided multi-modal bias removal strategy for temporal sentence grounding in video
    Ruan, Xiaowen
    Qi, Zhaobo
    Xu, Yuanrong
    Zhang, Weigang
    MULTIMEDIA SYSTEMS, 2025, 31 (02)