VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

被引：0

作者：

Yamazaki, Kashu ^{[1
]}

Vo, Khoa ^{[1
]}

Truong, Quang Sang ^{[1
]}

Raj, Bhiksha ^{[2
,3
]}

Le, Ngan ^{[1
]}

机构：

[1] Univ Arkansas, AICV Lab, Fayetteville, AR 72701 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA USA

[3] Mohammed Bin Zayed Univ AI, Abu Dhabi, U Arab Emirates

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3 | 2023年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with multiple temporal event locations in a coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformerin-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee the learnt embedding features are consistent with the captions semantics. Comprehensive experiments and extensive ablation studies on the ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms previous state-of-the-art methods in terms of accuracy and diversity. The source code is made publicly available at: https://github.com/UARK-AICV/VLTinT.

引用

页码：3081 / 3090

页数：10

共 50 条

[31] Interaction augmented transformer with decoupled decoding for video captioning q
Jin, Tao
Zhao, Zhou
Wang, Peng
Yu, Jun
Wu, Fei
NEUROCOMPUTING, 2022, 492 : 496 - 507
[32] Bridging Video and Text: A Two-Step Polishing Transformer for Video Captioning
Xu, Wanru
Miao, Zhenjiang
Yu, Jian
Tian, Yi
Wan, Lili
Ji, Qiang
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (09) : 6293 - 6307
[33] Image Captioning using Visual Attention and Detection Transformer Model
Eluri, Yaswanth
Vinutha, N.
Jeevika, M.
Sree, Sai Bhavya N.
Abhiram, G. Surya
10TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTING AND COMMUNICATION TECHNOLOGIES, CONECCT 2024, 2024,
[34] Dual-visual collaborative enhanced transformer for image captioning
Mou, Zhenping
Song, Tianqi
Luo, Hong
MULTIMEDIA SYSTEMS, 2025, 31 (02)
[35] LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering
Jiang, Jingjing
Liu, Ziyi
Zheng, Nanning
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5002 - 5013
[36] Label-attention transformer with geometrically coherent objects for image captioning
Dubey, Shikha
Olimov, Farrukh
Rafique, Muhammad Aasim
Kim, Joonmo
Jeon, Moongu
INFORMATION SCIENCES, 2023, 623 : 812 - 831
[37] Hierarchical Time-Aware Summarization with an Adaptive Transformer for Video Captioning
Cardoso, Leonardo Vilela
Guimaraes, Silvio Jamil Ferzoli
do Patrocinio Jr, Zenilton Kleber Goncalves
INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2023, 17 (04) : 569 - 592
[38] Multimodal Transformer With Multi-View Visual Representation for Image Captioning
Yu, Jun
Li, Jing
Yu, Zhou
Huang, Qingming
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (12) : 4467 - 4480
[39] Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning
Liu, Anli
Meng, Lingwu
Xiao, Liang
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 20026 - 20040
[40] TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning
Zhang, Zhebin
Lu, Peng
Jiang, Dawei
Chen, Gang
WEB AND BIG DATA, PT II, APWEB-WAIM 2022, 2023, 13422 : 341 - 355

← 1 2 3 4 5 →