VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

被引：0

作者：

Yamazaki, Kashu ^{[1
]}

Vo, Khoa ^{[1
]}

Truong, Quang Sang ^{[1
]}

Raj, Bhiksha ^{[2
,3
]}

Le, Ngan ^{[1
]}

机构：

[1] Univ Arkansas, AICV Lab, Fayetteville, AR 72701 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA USA

[3] Mohammed Bin Zayed Univ AI, Abu Dhabi, U Arab Emirates

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3 | 2023年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with multiple temporal event locations in a coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformerin-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee the learnt embedding features are consistent with the captions semantics. Comprehensive experiments and extensive ablation studies on the ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms previous state-of-the-art methods in terms of accuracy and diversity. The source code is made publicly available at: https://github.com/UARK-AICV/VLTinT.

引用

页码：3081 / 3090

页数：10

共 50 条

[41] Time-frequency recurrent transformer with diversity constraint for dense video captioning
Li, Ping
Zhang, Pan
Wang, Tao
Xiao, Huaxin
INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)
[42] Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning
Man, Xin
Ouyang, Deqiang
Li, Xiangpeng
Song, Jingkuan
Shao, Jie
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (04)
[43] Long Short-Term Relation Transformer With Global Gating for Video Captioning
Li, Liang
Gao, Xingyu
Deng, Jincan
Tu, Yunbin
Zha, Zheng-Jun
Huang, Qingming
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 2726 - 2738
[44] GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features
Van-Quang Nguyen
Suganuma, Masanori
Okatani, Takayuki
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 167 - 184
[45] Dual-adaptive interactive transformer with textual and visual context for image captioning
Chen, Lizhi
Li, Kesen
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 243
[46] Video summarization with temporal-channel visual transformer
Tian, Xiaoyan
Jin, Ye
Zhang, Zhao
Liu, Peng
Tang, Xianglong
PATTERN RECOGNITION, 2025, 165
[47] A Neural ODE and Transformer-based Model for Temporal Understanding and Dense Video Captioning
Artham, Sainithin
Shaikh, Soharab Hossain
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (23) : 64037 - 64056
[48] RESTHT: relation-enhanced spatial-temporal hierarchical transformer for video captioning
Zheng, Lihuan
Xu, Wanru
Miao, Zhenjiang
Qiu, Xinxiu
Gong, Shanshan
VISUAL COMPUTER, 2025, 41 (01): : 591 - 604
[49] End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning
Ran, Yuting
Fang, Bin
Chen, Lei
Wei, Xuekai
Xian, Weizhi
Zhou, Mingliang
JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2024, 33 (04)
[50] Modeling Context-Guided Visual and Linguistic Semantic Feature for Video Captioning
Sun, Zhixin
Zhong, Xian
Chen, Shuqin
Liu, Wenxuan
Feng, Duxiu
Li, Lin
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 677 - 689

← 1 2 3 4 5 →