Post-Attention Modulator for Dense Video Captioning

被引:1
|
作者
Guo, Zixin [1 ]
Wang, Tzu-Jui Julius [1 ]
Laaksonen, Jorma [1 ]
机构
[1] Aalto Univ, Sch Sci, Dept Comp Sci, Espoo, Finland
基金
芬兰科学院;
关键词
D O I
10.1109/ICPR56361.2022.9956260
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense video captioning (VC) aims at generating a paragraph-long description for events in video segments. Borrowing from the success in language modeling, Transformer-based models for VC have been shown effective also in modeling cross-domain video-text representations with cross-attention (Xatt). Despite Xatt's effectiveness, the queries and outputs of attention, which are from different domains, tend to be weakly related. In this paper, we argue that the weak relatedness, or domain discrepancy, could impede a model from learning meaningful cross-domain representations. Hence, we propose a simple yet effective Post-Attention Modulator (PAM) that post-processes Xatt's outputs to narrow the discrepancy. Specifically, PAM modulates and enhances the average similarity over Xatt's queries and outputs. The modulated similarities are then utilized as a weighting basis to interpolate PAM's outputs. In our experiments, PAM was applied to two strong VC baselines, VTransformer and MART, with two different video features on the well-known VC benchmark datasets ActivityNet Captions and YouCookII. According to the results, the proposed PAM brings consistent improvements in, e.g., CIDEr-D at most to 14.5%, as well as other metrics, BLEU and METEOR, considered.
引用
收藏
页码:1536 / 1542
页数:7
相关论文
共 50 条
  • [31] Hierarchical convolutional neural networks with post-attention for speech emotion recognition
    Fan, Yonghong
    Huang, Heming
    Han, Henry
    NEUROCOMPUTING, 2025, 615
  • [32] Traffic Scenario Understanding and Video Captioning via Guidance Attention Captioning Network
    Liu, Chunsheng
    Zhang, Xiao
    Chang, Faliang
    Li, Shuang
    Hao, Penghui
    Lu, Yansha
    Wang, Yinhai
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (05) : 3615 - 3627
  • [33] Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
    Han, Shixing
    Liu, Jin
    Zhang, Jinyingming
    Gong, Peizhu
    Zhang, Xiliang
    He, Huihua
    COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (05) : 4995 - 5012
  • [34] Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
    Shixing Han
    Jin Liu
    Jinyingming Zhang
    Peizhu Gong
    Xiliang Zhang
    Huihua He
    Complex & Intelligent Systems, 2023, 9 : 4995 - 5012
  • [35] Multimodal attention-based transformer for video captioning
    Hemalatha Munusamy
    Chandra Sekhar C
    Applied Intelligence, 2023, 53 : 23349 - 23368
  • [36] Correction to: Attention based video captioning framework for Hindi
    Alok Singh
    Thoudam Doren Singh
    Sivaji Bandyopadhyay
    Multimedia Systems, 2023, 29 (1) : 453 - 453
  • [37] Saliency-Based Spatiotemporal Attention for Video Captioning
    Chen, Yangyu
    Zhang, Weigang
    Wang, Shuhui
    Li, Liang
    Huang, Qingming
    2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
  • [38] Video captioning with stacked attention and semantic hard pull
    Rahman, Md Mushfiqur
    Abedin, Thasin
    Prottoy, Khondokar S. S.
    Moshruba, Ayana
    Siddiqui, Fazlul Hasan
    PEERJ COMPUTER SCIENCE, 2021, 7 : 1 - 18
  • [39] Dense Captioning Method Based on Multi-attention Structure
    Liu Q.-R.
    Li G.
    Zhao C.
    Gu G.-H.
    Zhao Y.
    Zidonghua Xuebao/Acta Automatica Sinica, 2022, 48 (10): : 2537 - 2548
  • [40] An attention based dual learning approach for video captioning
    Ji, Wanting
    Wang, Ruili
    Tian, Yan
    Wang, Xun
    APPLIED SOFT COMPUTING, 2022, 117