CLIP-It! Language-Guided Video Summarization

被引:0
|
作者
Narasimhan, Medhini [1 ]
Rohrbach, Anna [1 ]
Darrell, Trevor [1 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes. Yet the importance of scenes in a video is often subjective, and users should have the option of customizing the summary by using natural language to specify what is important to them. Further, existing models for fully automatic generic summarization have not exploited available language models, which can serve as an effective prior for saliency. This work introduces CLIP-It, a single framework for addressing both generic and queryfocused video summarization, typically approached separately in the literature. We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another and their correlation with a user-defined query (for query-focused summarization) or an automatically generated dense video caption (for generic video summarization). Our model can be extended to the unsupervised setting by training without ground-truth supervision. We outperform baselines and prior work by a significant margin on both standard video summarization datasets (TVSum and SumMe) and a query-focused video summarization dataset (QFVS). Particularly, we achieve large improvements in the transfer setting, attesting to our method's strong generalization capabilities.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Language-Guided Transformer for Federated Multi-Label Classification
    Liu, I-Jieh
    Lin, Ci-Siang
    Yang, Fu-En
    Wang, Yu-Chiang Frank
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 12, 2024, : 13882 - 13890
  • [42] LucIE: Language-guided local image editing for fashion images
    Wen, Huanglu
    You, Shaodi
    Fu, Ying
    COMPUTATIONAL VISUAL MEDIA, 2025, 11 (01): : 179 - 194
  • [43] Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images
    Li, Ke
    Wang, Di
    Xu, Haojie
    Zhong, Haodi
    Wang, Cong
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 1
  • [44] LANDMARK: language-guided representation enhancement framework for scene graph generation
    Xiaoguang Chang
    Teng Wang
    Shaowei Cai
    Changyin Sun
    Applied Intelligence, 2023, 53 : 26126 - 26138
  • [45] LPN: Language-Guided Prototypical Network for Few-Shot Classification
    Cheng, Kaihui
    Yang, Chule
    Liu, Xiao
    Guan, Naiyang
    Wang, Zhiyuan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (01) : 632 - 642
  • [46] Language-Guided Face Animation by Recurrent StyleGAN-Based Generator
    Hang, Tiankai
    Yang, Huan
    Liu, Bei
    Fu, Jianlong
    Geng, Xin
    Guo, Baining
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9216 - 9227
  • [47] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
    Rao, Yongming
    Zhao, Wenliang
    Chen, Guangyi
    Tang, Yansong
    Zhu, Zheng
    Huang, Guan
    Zhou, Jie
    Lu, Jiwen
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18061 - 18070
  • [48] A language-guided cross-modal semantic fusion retrieval method
    Zhu, Ligu
    Zhou, Fei
    Wang, Suping
    Shi, Lei
    Kou, Feifei
    Li, Zeyu
    Zhou, Pengpeng
    SIGNAL PROCESSING, 2025, 234
  • [49] Language-guided Semantic Mapping and Mobile Manipulation in Partially Observable Environments
    Patki, Siddharth
    Fahnestock, Ethan
    Howard, Thomas M.
    Walter, Matthew R.
    CONFERENCE ON ROBOT LEARNING, VOL 100, 2019, 100
  • [50] LASO: Language-guided Affordance Segmentation on 3D Object
    Li, Yicong
    Zhao, Na
    Xiao, Junbin
    Feng, Chun
    Wang, Xiang
    Chua, Tat-seng
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 14251 - 14260