CLIP-It! Language-Guided Video Summarization

被引:0
|
作者
Narasimhan, Medhini [1 ]
Rohrbach, Anna [1 ]
Darrell, Trevor [1 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes. Yet the importance of scenes in a video is often subjective, and users should have the option of customizing the summary by using natural language to specify what is important to them. Further, existing models for fully automatic generic summarization have not exploited available language models, which can serve as an effective prior for saliency. This work introduces CLIP-It, a single framework for addressing both generic and queryfocused video summarization, typically approached separately in the literature. We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another and their correlation with a user-defined query (for query-focused summarization) or an automatically generated dense video caption (for generic video summarization). Our model can be extended to the unsupervised setting by training without ground-truth supervision. We outperform baselines and prior work by a significant margin on both standard video summarization datasets (TVSum and SumMe) and a query-focused video summarization dataset (QFVS). Particularly, we achieve large improvements in the transfer setting, attesting to our method's strong generalization capabilities.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Multimodal Speech Recognition for Language-Guided Embodied Agents
    Chang, Allen
    Zhu, Xiaoyuan
    Monga, Aarav
    Ahn, Seoho
    Srinivasan, Tejas
    Thomason, Jesse
    INTERSPEECH 2023, 2023, : 1608 - 1612
  • [22] ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization
    Wang, Hao
    Liu, Fang
    Jiao, Licheng
    Wang, Jiahao
    Hao, Zehua
    Li, Shuo
    Li, Lingling
    Chen, Puhua
    Liu, Xu
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5390 - 5400
  • [23] Learning by Planning: Language-Guided Global Image Editing
    Shi, Jing
    Xu, Ning
    Xu, Yihang
    Bui, Trung
    Dernoncourt, Franck
    Xu, Chenliang
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13585 - 13594
  • [24] A Simple Recipe for Language-guided Domain Generalized Segmentation
    Fahes, Mohammad
    Vu, Tuan-Hung
    Bursuc, Andrei
    Perez, Patrick
    de Charette, Raoul
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 23428 - 23437
  • [25] Enhancing Visual Continual Learning with Language-Guided Supervision
    Ni, Bolin
    Zhao, Hongbo
    Zhang, Chenghao
    Hu, Ke
    Meng, Gaofeng
    Zhang, Zhaoxiang
    Xiang, Shiming
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 24068 - 24077
  • [26] Learning Visual Representations via Language-Guided Sampling
    El Banani, Mohamed
    Desai, Karan
    Johnson, Justin
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19208 - 19220
  • [27] Language-guided Human Motion Synthesis with Atomic Actions
    Zhai, Yuanhao
    Huang, Mingzhen
    Luan, Tianyu
    Dong, Lu
    Nwogu, Ifeoma
    Lyu, Siwei
    Doermann, David
    Yuan, Junsong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5262 - 5271
  • [28] LANGUAGE-GUIDED ZERO-SHOT OBJECT COUNTING
    Wang, Mingjie
    Yuan, Song
    Li, Zhuohang
    Zhu, Longlong
    Buys, Eric
    Gong, Minglun
    2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS, ICMEW 2024, 2024,
  • [29] Video Clip Growth: A General Algorithm for Multi-view Video Summarization
    Pan, Gang
    Qu, Xingming
    Lv, Liangfu
    Guo, Shuai
    Sun, Di
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 112 - 122
  • [30] Language-Based Image Manipulation Built on Language-Guided Ranking
    Wu, Fuxiang
    Liu, Liu
    Hao, Fusheng
    He, Fengxiang
    Cheng, Jun
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 6219 - 6231