Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding

被引:0
|
作者
Bao, Peijun [1 ]
Xia, Yong [2 ]
Yang, Wenhan [3 ]
Ng, Boon Poh [1 ]
Er, Meng Hwa [1 ]
Kot, Alex C. [1 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Northwestern Polytech Univ, Xian, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper for the first time leverages multi-modal videos for weakly-supervised temporal video grounding. As labeling the video moment is labor-intensive and subjective, the weakly-supervised approaches have gained increasing attention in recent years. However, these approaches could inherently compromise performance due to inadequate supervision. Therefore, to tackle this challenge, we for the first time pay attention to exploiting complementary information extracted from multi-modal videos (e.g., RGB frames, optical flows), where richer supervision is naturally introduced in the weakly-supervised context. Our motivation is that by integrating different modalities of the videos, the model is learned from synergic supervision and thereby can attain superior generalization capability. However, addressing multiple modalities would also inevitably introduce additional computational overhead, and might become inapplicable if a particular modality is inaccessible. To solve this issue, we adopt a novel route: building a multi-modal distillation algorithm to capitalize on the multi-modal knowledge as supervision for model training, while still being able to work with only the single modal input during inference. As such, we can utilize the benefits brought by the supplementary nature of multiple modalities, without undermining the applicability in practical scenarios. Specifically, we first propose a cross-modal mutual learning framework and train a sophisticated teacher model to learn collaboratively from the multi-modal videos. Then we identify two sorts of knowledge from the teacher model, i.e., temporal boundaries and semantic activation map. And we devise a local-global distillation algorithm to transfer this knowledge to a student model of single-modal input at both local and global levels. Extensive experiments on large-scale datasets demonstrate that our method achieves state-of-the-art performance with/without multi-modal inputs.
引用
收藏
页码:738 / 746
页数:9
相关论文
共 50 条
  • [11] Weakly-supervised spatial-temporal video grounding via spatial-temporal annotation on a frame
    Luo, Shu
    Jiang, Shijie
    Cao, Da
    Deng, Huangxiao
    Wang, Jiawei
    Qin, Zheng
    KNOWLEDGE-BASED SYSTEMS, 2025, 314
  • [12] Weakly-Supervised Video Object Grounding via Causal Intervention
    Wang, Wei
    Gao, Junyu
    Xu, Changsheng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (03) : 3933 - 3948
  • [13] Weakly-Supervised Spoken Video Grounding via Semantic Interaction Learning
    Wang, Ye
    Lin, Wang
    Zhang, Shengyu
    Jin, Tao
    Li, Linjun
    Cheng, Xize
    Zhao, Zhou
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 10914 - 10932
  • [14] AsyNCE: Disentangling False-Positives for Weakly-Supervised Video Grounding
    Da, Cheng
    Zhang, Yanhao
    Zheng, Yun
    Pan, Pan
    Xu, Yinghui
    Pan, Chunhong
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1129 - 1137
  • [15] Weakly-Supervised Video Object Grounding via Stable Context Learning
    Wang, Wei
    Gao, Junyu
    Xu, Changsheng
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 760 - 768
  • [16] Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video
    Chen, Zhenfang
    Ma, Lin
    Luo, Wenhan
    Wong, Kwan-Yee K.
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1884 - 1894
  • [17] Hierarchical Local-Global Transformer for Temporal Sentence Grounding
    Fang, Xiang
    Liu, Daizong
    Zhou, Pan
    Xu, Zichuan
    Li, Ruixuan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3263 - 3277
  • [18] Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding
    Tan, Chaolei
    Lai, Jianhuang
    Zheng, Wei-Shi
    Hu, Jian-Fang
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13569 - 13580
  • [19] Weakly-Supervised Temporal Action Localization with Multi-Head Cross-Modal Attention
    Ren, Hao
    Ren, Haoran
    Ran, Wu
    Lu, Hong
    Jin, Cheng
    PRICAI 2022: TRENDS IN ARTIFICIAL INTELLIGENCE, PT III, 2022, 13631 : 281 - 295
  • [20] Multi-modal integrated proposal generation network for weakly supervised video moment retrieval
    Fang, Dikai
    Xu, Huahu
    Wei, Wei
    Guizani, Mohsen
    Gao, Honghao
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 269