Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding

被引:0
|
作者
Bao, Peijun [1 ]
Xia, Yong [2 ]
Yang, Wenhan [3 ]
Ng, Boon Poh [1 ]
Er, Meng Hwa [1 ]
Kot, Alex C. [1 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Northwestern Polytech Univ, Xian, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper for the first time leverages multi-modal videos for weakly-supervised temporal video grounding. As labeling the video moment is labor-intensive and subjective, the weakly-supervised approaches have gained increasing attention in recent years. However, these approaches could inherently compromise performance due to inadequate supervision. Therefore, to tackle this challenge, we for the first time pay attention to exploiting complementary information extracted from multi-modal videos (e.g., RGB frames, optical flows), where richer supervision is naturally introduced in the weakly-supervised context. Our motivation is that by integrating different modalities of the videos, the model is learned from synergic supervision and thereby can attain superior generalization capability. However, addressing multiple modalities would also inevitably introduce additional computational overhead, and might become inapplicable if a particular modality is inaccessible. To solve this issue, we adopt a novel route: building a multi-modal distillation algorithm to capitalize on the multi-modal knowledge as supervision for model training, while still being able to work with only the single modal input during inference. As such, we can utilize the benefits brought by the supplementary nature of multiple modalities, without undermining the applicability in practical scenarios. Specifically, we first propose a cross-modal mutual learning framework and train a sophisticated teacher model to learn collaboratively from the multi-modal videos. Then we identify two sorts of knowledge from the teacher model, i.e., temporal boundaries and semantic activation map. And we devise a local-global distillation algorithm to transfer this knowledge to a student model of single-modal input at both local and global levels. Extensive experiments on large-scale datasets demonstrate that our method achieves state-of-the-art performance with/without multi-modal inputs.
引用
收藏
页码:738 / 746
页数:9
相关论文
共 50 条
  • [1] Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment
    Jin, Yang
    Mu, Yadong
    COMPUTER VISION - ECCV 2024, PT XLVIII, 2025, 15106 : 412 - 429
  • [2] Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
    Mo, Shentong
    Tian, Yapeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [3] Rethinking Weakly-Supervised Video Temporal Grounding From a Game Perspective
    Fang, Xiang
    Xiong, Zeyu
    Fang, Wanlong
    Qu, Xiaoye
    Chen, Chen
    Dong, Jianfeng
    Tang, Keke
    Zhou, Pan
    Cheng, Yu
    Liu, Daizong
    COMPUTER VISION - ECCV 2024, PT XLV, 2025, 15103 : 290 - 311
  • [4] Weakly-Supervised Video Object Grounding by Exploring Spatio-Temporal Contexts
    Yang, Xun
    Liu, Xueliang
    Jian, Meng
    Gao, Xinjian
    Wang, Meng
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1939 - 1947
  • [5] Weakly-Supervised Video Object Grounding via Learning Uni-Modal Associations
    Wang, Wei
    Gao, Junyu
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 6329 - 6340
  • [6] Iterative Proposal Refinement for Weakly-Supervised Video Grounding
    School of Electronic and Computer Engineering, Peking University, China
    不详
    不详
    不详
    Proc IEEE Comput Soc Conf Comput Vision Pattern Recognit, (6524-6534): : 6524 - 6534
  • [7] Flow Graph to Video Grounding for Weakly-Supervised Multi-step Localization
    Dvornik, Nikita
    Hadji, Isma
    Pham, Hai
    Bhatt, Dhaivat
    Martinez, Brais
    Fazly, Afsaneh
    Jepson, Allan D.
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 319 - 335
  • [8] WINNER: Weakly-supervised hIerarchical decompositioN and aligNment for spatio-tEmporal video gRounding
    Li, Mengze
    Wang, Han
    Zhang, Wengiao
    Miao, Jiaxu
    Zhao, Zhou
    Zhang, Shengyu
    Ji, Wei
    Wu, Fei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23090 - 23099
  • [9] End-to-end Multi-modal Video Temporal Grounding
    Chen, Yi-Wen
    Tsai, Yi-Hsuan
    Yang, Ming-Hsuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [10] A weakly-supervised deep domain adaptation method for multi-modal sensor data
    Mihailescu, Radu-Casian
    2021 IEEE GLOBAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INTERNET OF THINGS (GCAIOT), 2021, : 45 - 50