Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models for Video Captioning and Summarization

被引:0
|
作者
Luo, Richard [1 ]
Peng, Austin [1 ]
Vasudev, Adithya [1 ]
Jain, Rishabh [1 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
关键词
Deep Learning; Multimodal Models; Large Language Models; Machine Learning; Natural Language Processing; Vision; Vision-Language Models;
D O I
10.1145/3689091.3690086
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video is an increasingly prominent and information-dense medium, yet it poses substantial challenges for language models. A typical video consists of a sequence of shorter segments, or shots, that collectively form a coherent narrative. Each shot is analogous to a word in a sentence where multiple data streams of information (such as visual and auditory data) must be processed simultaneously. Comprehension of the entire video requires not only understanding the visual-audio information of each shot but also requires that the model links the ideas between each shot to generate a larger, all-encompassing story. Despite significant progress in the field, current works often overlook videos' more granular shot-by-shot semantic information. In this project, we propose a family of efficient large language vision models (LLVMs) to boost video summarization and captioning called Shotluck Holmes. By leveraging better pretraining and data collection strategies, we extend the abilities of existing small LLVMs from being able to understand a picture to being able to understand a sequence of frames. Specifically, we show that Shotluck Holmes achieves better performance than state-of-the-art results on the Shot2Story video captioning and summary task with significantly smaller and more computationally efficient models.
引用
收藏
页码:7 / 11
页数:5
相关论文
共 34 条
  • [31] Genotyping crossing parents and family bulks can facilitate cost-efficient genomic prediction strategies in small-scale line breeding programs
    Michel, Sebastian
    Loeschenberger, Franziska
    Ametz, Christian
    Buerstmayr, Hermann
    THEORETICAL AND APPLIED GENETICS, 2021, 134 (05) : 1575 - 1586
  • [32] Genotyping crossing parents and family bulks can facilitate cost-efficient genomic prediction strategies in small-scale line breeding programs
    Sebastian Michel
    Franziska Löschenberger
    Christian Ametz
    Hermann Bürstmayr
    Theoretical and Applied Genetics, 2021, 134 : 1575 - 1586
  • [33] Efficient fine-tuning of small-parameter large language models for biomedical bilingual multi-task applications
    Li, Yinghong
    Yan, Yudong
    Tong, Zhuohao
    Wang, Yu
    Yang, Yinqi
    Bai, Mingze
    Pu, Dan
    Xie, Jiazheng
    Liu, Chuan
    Li, Bo
    Liu, Mingwei
    Shu, Kunxian
    APPLIED SOFT COMPUTING, 2025, 175
  • [34] The application of computational fluid dynamics and small-scale physical models to assess the effects of operational practices on the risk to public health within large indoor swimming pools
    Lewis, Lowell
    Chew, John
    Woodley, Iain
    Colbourne, Jeni
    Pond, Katherine
    JOURNAL OF WATER AND HEALTH, 2015, 13 (04) : 939 - 952