A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model

被引:0
|
作者
Hu, Panwen [1 ]
Xiao, Nan [1 ]
Li, Feifei [1 ]
Chen, Yongquan [2 ]
Huang, Rui [1 ]
机构
[1] Chinese Univ Hong Kong, SSE, Shenzhen, Peoples R China
[2] Chinese Univ Hong Kong, AIRS, Shenzhen, Peoples R China
关键词
video editing; video representation; reinforcement learning; BROADCAST; CAPTURE; FILM;
D O I
10.1145/3581783.3611878
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this era of videos, automatic video editing techniques attract more and more attention from industry and academia since they can reduce workloads and lower the requirements for human editors. Existing automatic editing systems are mainly scene- or event-specific, e.g., soccer game broadcasting, yet the automatic systems for general editing, e.g., movie or vlog editing which covers various scenes and events, were rarely studied before, and converting the event-driven editing method to a general scene is nontrivial. In this paper, we propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM) to extract the editing-relevant representations as editing context. Moreover, to close the gap between the professional-looking videos and the automatic productions generated with simple guidelines, we propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions. Finally, we evaluate the proposed method on a more general editing task with a real movie dataset. Experimental results demonstrate the effectiveness and benefits of the proposed context representation and the learning ability of our RL-based editing framework.
引用
收藏
页码:6441 / 6450
页数:10
相关论文
共 50 条
  • [21] Model Based Reinforcement Learning Pre-Trained with Various State Data
    Ono, Masaaki
    Ichise, Ryutaro
    2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 918 - 925
  • [22] Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models
    Zheng, Kecheng
    Wu, Wei
    Feng, Ruili
    Zhu, Kai
    Liu, Jiawei
    Zhao, Deli
    Zha, Zheng-Jun
    Chen, Wei
    Shen, Yujun
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11629 - 11639
  • [23] Harnessing the Power of Pre-trained Vision-Language Models for Efficient Medical Report Generation
    Li, Qi
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 1308 - 1317
  • [24] VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models
    Yin, Ziyi
    Ye, Muchao
    Zhang, Tianrong
    Du, Tianyu
    Zhu, Jinguo
    Liu, Han
    Chen, Jinghui
    Wang, Ting
    Ma, Fenglong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [25] Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained Transformers
    Dalai, Tusarkanta
    Kumarmishra, Tapas
    Sa, Andpankaj K.
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (02)
  • [26] Reinforcement Learning Friendly Vision-Language Model for Minecraft
    Jiang, Haobin
    Yue, Junpeng
    Luo, Hao
    Ding, Ziluo
    Lu, Zongqing
    COMPUTER VISION - ECCV 2024, PT XXXVII, 2025, 15095 : 1 - 17
  • [27] SPEECHCLIP: INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL
    Shih, Yi-Jen
    Wang, Hsuan-Fu
    Chang, Heng-Jui
    Berry, Layne
    Lee, Hung-yi
    Harwath, David
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 715 - 722
  • [28] CLIP4STR: A Simple Baseline for Scene Text Recognition With Pre-Trained Vision-Language Model
    Zhao, Shuai
    Quan, Ruijie
    Zhu, Linchao
    Yang, Yi
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6893 - 6904
  • [29] CLIPose: Category-Level Object Pose Estimation With Pre-Trained Vision-Language Knowledge
    Lin, Xiao
    Zhu, Minghao
    Dang, Ronghao
    Zhou, Guangliang
    Shu, Shaolong
    Lin, Feng
    Liu, Chengju
    Chen, Qijun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9125 - 9138
  • [30] ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense
    Zhou, Kankan
    Lai, Eason
    Yeong, Wei Bin Au
    Mouratidis, Kyriakos
    Jiang, Jing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 10185 - 10197