Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models for Video Captioning and Summarization

被引:0
|
作者
Luo, Richard [1 ]
Peng, Austin [1 ]
Vasudev, Adithya [1 ]
Jain, Rishabh [1 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
关键词
Deep Learning; Multimodal Models; Large Language Models; Machine Learning; Natural Language Processing; Vision; Vision-Language Models;
D O I
10.1145/3689091.3690086
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video is an increasingly prominent and information-dense medium, yet it poses substantial challenges for language models. A typical video consists of a sequence of shorter segments, or shots, that collectively form a coherent narrative. Each shot is analogous to a word in a sentence where multiple data streams of information (such as visual and auditory data) must be processed simultaneously. Comprehension of the entire video requires not only understanding the visual-audio information of each shot but also requires that the model links the ideas between each shot to generate a larger, all-encompassing story. Despite significant progress in the field, current works often overlook videos' more granular shot-by-shot semantic information. In this project, we propose a family of efficient large language vision models (LLVMs) to boost video summarization and captioning called Shotluck Holmes. By leveraging better pretraining and data collection strategies, we extend the abilities of existing small LLVMs from being able to understand a picture to being able to understand a sequence of frames. Specifically, we show that Shotluck Holmes achieves better performance than state-of-the-art results on the Shot2Story video captioning and summary task with significantly smaller and more computationally efficient models.
引用
收藏
页码:7 / 11
页数:5
相关论文
共 34 条
  • [1] Grounding Conversational Robots on Vision Through Dense Captioning and Large Language Models
    Grassi, Lucrezia
    Hong, Zhouyang
    Recchiuto, Carmine Tommaso
    Sgorbissa, Antonio
    2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2024, 2024, : 5492 - 5498
  • [2] SMALL-SCALE MODELS EQUAL LARGE-SCALE SAVINGS
    LEE, R
    SEGROVES, R
    NUCLEAR ENGINEERING INTERNATIONAL, 1994, 39 (481): : 42 - 43
  • [3] Parameter-Efficient Adaptation of Large Vision-Language Models for Video Memorability Prediction
    Martin-Fernandez, Ivan
    Esteban-Romero, Sergio
    Fernandez-Martinez, Fernando
    Gil-Martin, Manuel
    SENSORS, 2025, 25 (06)
  • [4] FROM SMALL-SCALE FRACTALITY TO LARGE-SCALE HOMOGENEITY - A FAMILY OF CASCADING MODELS FOR THE DISTRIBUTION OF GALAXIES
    CASTAGNOLI, C
    PROVENZALE, A
    ASTRONOMY & ASTROPHYSICS, 1991, 246 (02) : 634 - 643
  • [5] Towards Artwork Explanation in Large-scale Vision Language Models
    Hayashi, Kazuki
    Sakai, Yusuke
    Kamigaito, Hidetaka
    Hayashi, Katsuhiko
    Watanabe, Taro
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 705 - 729
  • [6] Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
    Luo, Gen
    Zhou, Yiyi
    Ren, Tianhe
    Chen, Shengxin
    Sun, Xiaoshuai
    Ji, Rongrong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [7] Reconciling large- and small-scale structure in Twin Higgs models
    Prilepina, Valentina
    Tsai, Yuhsin
    JOURNAL OF HIGH ENERGY PHYSICS, 2017, (09):
  • [8] Reconciling large- and small-scale structure in Twin Higgs models
    Valentina Prilepina
    Yuhsin Tsai
    Journal of High Energy Physics, 2017
  • [9] LongVLM: Efficient Long Video Understanding via Large Language Models
    Weng, Yuetian
    Han, Mingfei
    He, Haoyu
    Chang, Xiaojun
    Zhuang, Bohan
    COMPUTER VISION - ECCV 2024, PT XXXIII, 2025, 15091 : 453 - 470
  • [10] Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
    Yang, Antoine
    Nagrani, Arsha
    Seo, Paul Hongsuck
    Miech, Antoine
    Pont-Tuset, Jordi
    Laptev, Ivan
    Sivic, Josef
    Schmid, Cordelia
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10714 - 10726