A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model

被引：0

作者：

Hu, Panwen ^{[1
]}

Xiao, Nan ^{[1
]}

Li, Feifei ^{[1
]}

Chen, Yongquan ^{[2
]}

Huang, Rui ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, SSE, Shenzhen, Peoples R China

[2] Chinese Univ Hong Kong, AIRS, Shenzhen, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

关键词：

video editing; video representation; reinforcement learning; BROADCAST; CAPTURE; FILM;

D O I：

10.1145/3581783.3611878

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this era of videos, automatic video editing techniques attract more and more attention from industry and academia since they can reduce workloads and lower the requirements for human editors. Existing automatic editing systems are mainly scene- or event-specific, e.g., soccer game broadcasting, yet the automatic systems for general editing, e.g., movie or vlog editing which covers various scenes and events, were rarely studied before, and converting the event-driven editing method to a general scene is nontrivial. In this paper, we propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM) to extract the editing-relevant representations as editing context. Moreover, to close the gap between the professional-looking videos and the automatic productions generated with simple guidelines, we propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions. Finally, we evaluate the proposed method on a more general editing task with a real movie dataset. Experimental results demonstrate the effectiveness and benefits of the proposed context representation and the learning ability of our RL-based editing framework.

引用

页码：6441 / 6450

页数：10

共 50 条

[1] Constraint embedding for prompt tuning in vision-language pre-trained model
Cheng, Keyang
Wei, Liutao
Tang, Jingfeng
Zhan, Yongzhao
MULTIMEDIA SYSTEMS, 2025, 31 (01)
[2] Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
Xing, Yinghui
Wu, Qirui
Cheng, De
Zhang, Shizhou
Liang, Guoqiang
Wang, Peng
Zhang, Yanning
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2056 - 2068
[3] Multimodal Search on Iconclass using Vision-Language Pre-Trained Models
Santini, Cristian
Posthumus, Etienne
Tietz, Tabea
Tan, Mary Ann
Bruns, Oleksandra
Sack, Harald
2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 285 - 287
[4] CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model
Zhao, Xiaoqing
Xu, Miaomiao
Silamu, Wushour
Li, Yanbing
SENSORS, 2024, 24 (22)
[5] Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis
An, Jieyu
Zainon, Wan Mohd Nazmee Wan
Ding, Binfen
INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 37 (02): : 1673 - 1689
[6] Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models
Wu, Qiong
Yu, Wei
Zhou, Yiyi
Huang, Shubin
Sun, Xiaoshuai
Ji, Rongrong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[7] Universal Adversarial Perturbations for Vision-Language Pre-trained Models
Zhang, Peng-Fei
Huang, Zi
Bai, Guangdong
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 862 - 871
[8] CPT: Colorful Prompt Tuning for pre-trained vision-language models
Yao, Yuan
Zhang, Ao
Zhang, Zhengyan
Liu, Zhiyuan
Chua, Tat-Seng
Sun, Maosong
AI OPEN, 2024, 5 : 30 - 38
[9] Open-World Object Manipulation using Pre-Trained Vision-Language Models
Stone, Austin
Xiao, Ted
Lu, Yao
Gopalakrishnan, Keerthana
Lee, Kuang-Huei
Quan Vuong
Wohlhart, Paul
Kirmani, Sean
Zitkovich, Brianna
Xia, Fei
Finn, Chelsea
Hausman, Karol
CONFERENCE ON ROBOT LEARNING, VOL 229, 2023, 229
[10] Constraint embedding for prompt tuning in vision-language pre-trained modelConstraint embedding for prompt tuning in vision-language pre-trained modelK. Cheng et al.
Keyang Cheng
Liutao Wei
Jingfeng Tang
Yongzhao Zhan
Multimedia Systems, 2025, 31 (1)

← 1 2 3 4 5 →