Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding

被引:0
|
作者
Bao, Peijun [1 ]
Xia, Yong [2 ]
Yang, Wenhan [3 ]
Ng, Boon Poh [1 ]
Er, Meng Hwa [1 ]
Kot, Alex C. [1 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Northwestern Polytech Univ, Xian, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper for the first time leverages multi-modal videos for weakly-supervised temporal video grounding. As labeling the video moment is labor-intensive and subjective, the weakly-supervised approaches have gained increasing attention in recent years. However, these approaches could inherently compromise performance due to inadequate supervision. Therefore, to tackle this challenge, we for the first time pay attention to exploiting complementary information extracted from multi-modal videos (e.g., RGB frames, optical flows), where richer supervision is naturally introduced in the weakly-supervised context. Our motivation is that by integrating different modalities of the videos, the model is learned from synergic supervision and thereby can attain superior generalization capability. However, addressing multiple modalities would also inevitably introduce additional computational overhead, and might become inapplicable if a particular modality is inaccessible. To solve this issue, we adopt a novel route: building a multi-modal distillation algorithm to capitalize on the multi-modal knowledge as supervision for model training, while still being able to work with only the single modal input during inference. As such, we can utilize the benefits brought by the supplementary nature of multiple modalities, without undermining the applicability in practical scenarios. Specifically, we first propose a cross-modal mutual learning framework and train a sophisticated teacher model to learn collaboratively from the multi-modal videos. Then we identify two sorts of knowledge from the teacher model, i.e., temporal boundaries and semantic activation map. And we devise a local-global distillation algorithm to transfer this knowledge to a student model of single-modal input at both local and global levels. Extensive experiments on large-scale datasets demonstrate that our method achieves state-of-the-art performance with/without multi-modal inputs.
引用
收藏
页码:738 / 746
页数:9
相关论文
共 50 条
  • [31] Leaky Gated Cross-Attention for Weakly Supervised Multi-Modal Temporal Action Localization
    Lee, Jun-Tae
    Yun, Sungrack
    Jain, Mihir
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 817 - 826
  • [32] Semi-supervised Grounding Alignment for Multi-modal Feature Learning
    Chou, Shih-Han
    Fan, Zicong
    Little, James J.
    Sigal, Leonid
    2022 19TH CONFERENCE ON ROBOTS AND VISION (CRV 2022), 2022, : 48 - 57
  • [33] Weakly Supervised Local-Global Relation Network for Facial Expression Recognition
    Zhang, Haifeng
    Su, Wen
    Yu, Jun
    Wang, Zengfu
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1040 - 1046
  • [34] Coupling Global Context and Local Contents for Weakly-Supervised Semantic Segmentation
    Wang, Chunyan
    Zhang, Dong
    Zhang, Liyan
    Tang, Jinhui
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (10) : 13483 - 13495
  • [35] Weakly-supervised semantic segmentation with superpixel guided local and global consistency
    Yi, Sheng
    Ma, Huimin
    Wang, Xiang
    Hu, Tianyu
    Li, Xi
    Wang, Yu
    PATTERN RECOGNITION, 2022, 124
  • [36] Not All Frames Are Equal: Weakly-Supervised Video Grounding with Contextual Similarity and Visual Clustering Losses
    Shi, Jing
    Xu, Jia
    Gong, Boqing
    Xu, Chenliang
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10436 - 10444
  • [37] Weakly-supervised video anomaly detection via temporal resolution feature learning
    Shengjun Peng
    Yiheng Cai
    Zijun Yao
    Meiling Tan
    Applied Intelligence, 2023, 53 : 30607 - 30625
  • [38] Coupling Global Context and Local Contents for Weakly-Supervised Semantic Segmentation
    Wang, Chunyan
    Zhang, Dong
    Zhang, Liyan
    Tang, Jinhui
    arXiv, 2023,
  • [39] Weakly-supervised video anomaly detection via temporal resolution feature learning
    Peng, Shengjun
    Cai, Yiheng
    Yao, Zijun
    Tan, Meiling
    APPLIED INTELLIGENCE, 2023, 53 (24) : 30607 - 30625
  • [40] Weakly Supervised Local-Global Attention Network for Facial Expression Recognition
    Zhang, Haifeng
    Su, Wen
    Wang, Zengfu
    IEEE ACCESS, 2020, 8 (08): : 37976 - 37987