VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

被引:0
|
作者
Yamazaki, Kashu [1 ]
Vo, Khoa [1 ]
Truong, Quang Sang [1 ]
Raj, Bhiksha [2 ,3 ]
Le, Ngan [1 ]
机构
[1] Univ Arkansas, AICV Lab, Fayetteville, AR 72701 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA USA
[3] Mohammed Bin Zayed Univ AI, Abu Dhabi, U Arab Emirates
来源
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3 | 2023年
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with multiple temporal event locations in a coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformerin-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee the learnt embedding features are consistent with the captions semantics. Comprehensive experiments and extensive ablation studies on the ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms previous state-of-the-art methods in terms of accuracy and diversity. The source code is made publicly available at: https://github.com/UARK-AICV/VLTinT.
引用
收藏
页码:3081 / 3090
页数:10
相关论文
共 50 条
  • [31] Interaction augmented transformer with decoupled decoding for video captioning q
    Jin, Tao
    Zhao, Zhou
    Wang, Peng
    Yu, Jun
    Wu, Fei
    NEUROCOMPUTING, 2022, 492 : 496 - 507
  • [32] Bridging Video and Text: A Two-Step Polishing Transformer for Video Captioning
    Xu, Wanru
    Miao, Zhenjiang
    Yu, Jian
    Tian, Yi
    Wan, Lili
    Ji, Qiang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (09) : 6293 - 6307
  • [33] Image Captioning using Visual Attention and Detection Transformer Model
    Eluri, Yaswanth
    Vinutha, N.
    Jeevika, M.
    Sree, Sai Bhavya N.
    Abhiram, G. Surya
    10TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTING AND COMMUNICATION TECHNOLOGIES, CONECCT 2024, 2024,
  • [34] Dual-visual collaborative enhanced transformer for image captioning
    Mou, Zhenping
    Song, Tianqi
    Luo, Hong
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [35] LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering
    Jiang, Jingjing
    Liu, Ziyi
    Zheng, Nanning
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5002 - 5013
  • [36] Label-attention transformer with geometrically coherent objects for image captioning
    Dubey, Shikha
    Olimov, Farrukh
    Rafique, Muhammad Aasim
    Kim, Joonmo
    Jeon, Moongu
    INFORMATION SCIENCES, 2023, 623 : 812 - 831
  • [37] Hierarchical Time-Aware Summarization with an Adaptive Transformer for Video Captioning
    Cardoso, Leonardo Vilela
    Guimaraes, Silvio Jamil Ferzoli
    do Patrocinio Jr, Zenilton Kleber Goncalves
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2023, 17 (04) : 569 - 592
  • [38] Multimodal Transformer With Multi-View Visual Representation for Image Captioning
    Yu, Jun
    Li, Jing
    Yu, Zhou
    Huang, Qingming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (12) : 4467 - 4480
  • [39] Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning
    Liu, Anli
    Meng, Lingwu
    Xiao, Liang
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 20026 - 20040
  • [40] TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning
    Zhang, Zhebin
    Lu, Peng
    Jiang, Dawei
    Chen, Gang
    WEB AND BIG DATA, PT II, APWEB-WAIM 2022, 2023, 13422 : 341 - 355