Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding

被引:0
|
作者
Bao, Peijun [1 ]
Xia, Yong [2 ]
Yang, Wenhan [3 ]
Ng, Boon Poh [1 ]
Er, Meng Hwa [1 ]
Kot, Alex C. [1 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Northwestern Polytech Univ, Xian, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper for the first time leverages multi-modal videos for weakly-supervised temporal video grounding. As labeling the video moment is labor-intensive and subjective, the weakly-supervised approaches have gained increasing attention in recent years. However, these approaches could inherently compromise performance due to inadequate supervision. Therefore, to tackle this challenge, we for the first time pay attention to exploiting complementary information extracted from multi-modal videos (e.g., RGB frames, optical flows), where richer supervision is naturally introduced in the weakly-supervised context. Our motivation is that by integrating different modalities of the videos, the model is learned from synergic supervision and thereby can attain superior generalization capability. However, addressing multiple modalities would also inevitably introduce additional computational overhead, and might become inapplicable if a particular modality is inaccessible. To solve this issue, we adopt a novel route: building a multi-modal distillation algorithm to capitalize on the multi-modal knowledge as supervision for model training, while still being able to work with only the single modal input during inference. As such, we can utilize the benefits brought by the supplementary nature of multiple modalities, without undermining the applicability in practical scenarios. Specifically, we first propose a cross-modal mutual learning framework and train a sophisticated teacher model to learn collaboratively from the multi-modal videos. Then we identify two sorts of knowledge from the teacher model, i.e., temporal boundaries and semantic activation map. And we devise a local-global distillation algorithm to transfer this knowledge to a student model of single-modal input at both local and global levels. Extensive experiments on large-scale datasets demonstrate that our method achieves state-of-the-art performance with/without multi-modal inputs.
引用
收藏
页码:738 / 746
页数:9
相关论文
共 50 条
  • [41] Bilateral Temporal Re-Aggregation for Weakly-Supervised Video Object Segmentation
    Lin, Fanchao
    Xie, Hongtao
    Liu, Chuanbin
    Zhang, Yongdong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (07) : 4498 - 4512
  • [42] Weakly-supervised video object localization with attentive spatio-temporal correlation
    Wang, Mingui
    Cui, Di
    Wu, Lifang
    Jian, Meng
    Chen, Yukun
    Wang, Dong
    Liu, Xu
    PATTERN RECOGNITION LETTERS, 2021, 145 : 232 - 239
  • [43] Triadic temporal-semantic alignment for weakly-supervised video moment retrieval
    Liu, Jin
    Xie, Jialong
    Zhou, Fengyu
    He, Shengfeng
    PATTERN RECOGNITION, 2024, 156
  • [44] Weakly-Supervised Vessel Detection in Ultra-Widefield Fundus Photography via Iterative Multi-Modal Registration and Learning
    Ding, Li
    Kuriyan, Ajay E.
    Ramchandran, Rajeev S.
    Wykoff, Charles C.
    Sharma, Gaurav
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2021, 40 (10) : 2748 - 2758
  • [45] Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning
    Tian, Yu
    Pang, Guansong
    Chen, Yuanhong
    Singh, Rajvinder
    Verjans, Johan W.
    Carneiro, Gustavo
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 4955 - 4966
  • [46] Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos
    Chen, Junwen
    Bao, Wentao
    Kong, Yu
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3789 - 3797
  • [47] Multi-modal Interactive Video Retrieval with Temporal Queries
    Heller, Silvan
    Arnold, Rahel
    Gasser, Ralph
    Gsteiger, Viktor
    Parian-Scherb, Mahnaz
    Rossetto, Luca
    Sauter, Loris
    Spiess, Florian
    Schuldt, Heiko
    MULTIMEDIA MODELING, MMM 2022, PT II, 2022, 13142 : 493 - 498
  • [48] Global and local priming in a multi-modal context
    List, Alexandra
    FRONTIERS IN HUMAN NEUROSCIENCE, 2023, 16
  • [49] Exploring Denoised Cross-video Contrast for Weakly-supervised Temporal Action Localization
    Li, Jingjing
    Yang, Tianyu
    Ji, Wei
    Wang, Jue
    Cheng, Li
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19882 - 19892
  • [50] Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
    Cheng, Haoyue
    Liu, Zhaoyang
    Zhou, Hang
    Qian, Chen
    Wu, Wayne
    Wang, Limin
    COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 431 - 448