VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

被引:0
|
作者
Yamazaki, Kashu [1 ]
Vo, Khoa [1 ]
Truong, Quang Sang [1 ]
Raj, Bhiksha [2 ,3 ]
Le, Ngan [1 ]
机构
[1] Univ Arkansas, AICV Lab, Fayetteville, AR 72701 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA USA
[3] Mohammed Bin Zayed Univ AI, Abu Dhabi, U Arab Emirates
来源
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3 | 2023年
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with multiple temporal event locations in a coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformerin-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee the learnt embedding features are consistent with the captions semantics. Comprehensive experiments and extensive ablation studies on the ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms previous state-of-the-art methods in terms of accuracy and diversity. The source code is made publicly available at: https://github.com/UARK-AICV/VLTinT.
引用
收藏
页码:3081 / 3090
页数:10
相关论文
共 50 条
  • [41] Time-frequency recurrent transformer with diversity constraint for dense video captioning
    Li, Ping
    Zhang, Pan
    Wang, Tao
    Xiao, Huaxin
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)
  • [42] Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning
    Man, Xin
    Ouyang, Deqiang
    Li, Xiangpeng
    Song, Jingkuan
    Shao, Jie
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (04)
  • [43] Long Short-Term Relation Transformer With Global Gating for Video Captioning
    Li, Liang
    Gao, Xingyu
    Deng, Jincan
    Tu, Yunbin
    Zha, Zheng-Jun
    Huang, Qingming
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 2726 - 2738
  • [44] GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features
    Van-Quang Nguyen
    Suganuma, Masanori
    Okatani, Takayuki
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 167 - 184
  • [45] Dual-adaptive interactive transformer with textual and visual context for image captioning
    Chen, Lizhi
    Li, Kesen
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 243
  • [46] Video summarization with temporal-channel visual transformer
    Tian, Xiaoyan
    Jin, Ye
    Zhang, Zhao
    Liu, Peng
    Tang, Xianglong
    PATTERN RECOGNITION, 2025, 165
  • [47] A Neural ODE and Transformer-based Model for Temporal Understanding and Dense Video Captioning
    Artham, Sainithin
    Shaikh, Soharab Hossain
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (23) : 64037 - 64056
  • [48] RESTHT: relation-enhanced spatial-temporal hierarchical transformer for video captioning
    Zheng, Lihuan
    Xu, Wanru
    Miao, Zhenjiang
    Qiu, Xinxiu
    Gong, Shanshan
    VISUAL COMPUTER, 2025, 41 (01): : 591 - 604
  • [49] End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning
    Ran, Yuting
    Fang, Bin
    Chen, Lei
    Wei, Xuekai
    Xian, Weizhi
    Zhou, Mingliang
    JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2024, 33 (04)
  • [50] Modeling Context-Guided Visual and Linguistic Semantic Feature for Video Captioning
    Sun, Zhixin
    Zhong, Xian
    Chen, Shuqin
    Liu, Wenxuan
    Feng, Duxiu
    Li, Lin
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 677 - 689