Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models for Video Captioning and Summarization

被引:0
|
作者
Luo, Richard [1 ]
Peng, Austin [1 ]
Vasudev, Adithya [1 ]
Jain, Rishabh [1 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
关键词
Deep Learning; Multimodal Models; Large Language Models; Machine Learning; Natural Language Processing; Vision; Vision-Language Models;
D O I
10.1145/3689091.3690086
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video is an increasingly prominent and information-dense medium, yet it poses substantial challenges for language models. A typical video consists of a sequence of shorter segments, or shots, that collectively form a coherent narrative. Each shot is analogous to a word in a sentence where multiple data streams of information (such as visual and auditory data) must be processed simultaneously. Comprehension of the entire video requires not only understanding the visual-audio information of each shot but also requires that the model links the ideas between each shot to generate a larger, all-encompassing story. Despite significant progress in the field, current works often overlook videos' more granular shot-by-shot semantic information. In this project, we propose a family of efficient large language vision models (LLVMs) to boost video summarization and captioning called Shotluck Holmes. By leveraging better pretraining and data collection strategies, we extend the abilities of existing small LLVMs from being able to understand a picture to being able to understand a sequence of frames. Specifically, we show that Shotluck Holmes achieves better performance than state-of-the-art results on the Shot2Story video captioning and summary task with significantly smaller and more computationally efficient models.
引用
收藏
页码:7 / 11
页数:5
相关论文
共 34 条
  • [21] The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation
    Zara, Giacomo
    Conti, Alessandro
    Roy, Subhankar
    Lathuiliere, Stephane
    Rota, Paolo
    Ricci, Elisa
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10273 - 10283
  • [22] AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering
    Chen, Xiuyuan
    Lin, Yuan
    Zhang, Yuchen
    Huang, Weiran
    COMPUTER VISION - ECCV 2024, PT XXXVII, 2025, 15095 : 179 - 195
  • [23] Enabling Energy-Efficient Deployment of Large Language Models on Memristor Crossbar: A Synergy of Large and Small
    Wang, Zhehui
    Luo, Tao
    Liu, Cheng
    Liu, Weichen
    Goh, Rick Siow Mong
    Wong, Weng-Fai
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (02) : 916 - 933
  • [24] Near Field FDTD: An efficient large time-step TD Poission approach to small-scale electrodynamics
    Neuhauser, Daniel
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2011, 242
  • [25] Heterogeneity, spontaneous coordination and extreme events within large-scale and small-scale agent-based financial market models
    Noemi Schmitt
    Frank Westerhoff
    Journal of Evolutionary Economics, 2017, 27 : 1041 - 1070
  • [26] Heterogeneity, spontaneous coordination and extreme events within large-scale and small-scale agent-based financial market models
    Schmitt, Noemi
    Westerhoff, Frank
    JOURNAL OF EVOLUTIONARY ECONOMICS, 2017, 27 (05) : 1041 - 1070
  • [27] LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models
    Hu, Zhiqiang
    Wang, Lei
    Lan, Yihuai
    Xu, Wanyu
    Lim, Ee-Peng
    Bing, Lidong
    Xu, Xing
    Poria, Soujanya
    Lee, Roy Ka-Wei
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 5254 - 5276
  • [28] Parameter-efficient fine-tuning of large-scale pre-trained language models
    Ding, Ning
    Qin, Yujia
    Yang, Guang
    Wei, Fuchao
    Yang, Zonghan
    Su, Yusheng
    Hu, Shengding
    Chen, Yulin
    Chan, Chi-Min
    Chen, Weize
    Yi, Jing
    Zhao, Weilin
    Wang, Xiaozhi
    Liu, Zhiyuan
    Zheng, Hai-Tao
    Chen, Jianfei
    Liu, Yang
    Tang, Jie
    Li, Juanzi
    Sun, Maosong
    NATURE MACHINE INTELLIGENCE, 2023, 5 (03) : 220 - +
  • [29] Parameter-efficient fine-tuning of large-scale pre-trained language models
    Ning Ding
    Yujia Qin
    Guang Yang
    Fuchao Wei
    Zonghan Yang
    Yusheng Su
    Shengding Hu
    Yulin Chen
    Chi-Min Chan
    Weize Chen
    Jing Yi
    Weilin Zhao
    Xiaozhi Wang
    Zhiyuan Liu
    Hai-Tao Zheng
    Jianfei Chen
    Yang Liu
    Jie Tang
    Juanzi Li
    Maosong Sun
    Nature Machine Intelligence, 2023, 5 : 220 - 235
  • [30] Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
    Pan, Yingwei
    Li, Yehao
    Luo, Jianjie
    Xu, Jun
    Yao, Ting
    Tao Mei
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7070 - 7074