Post-Attention Modulator for Dense Video Captioning

被引:1
|
作者
Guo, Zixin [1 ]
Wang, Tzu-Jui Julius [1 ]
Laaksonen, Jorma [1 ]
机构
[1] Aalto Univ, Sch Sci, Dept Comp Sci, Espoo, Finland
基金
芬兰科学院;
关键词
D O I
10.1109/ICPR56361.2022.9956260
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense video captioning (VC) aims at generating a paragraph-long description for events in video segments. Borrowing from the success in language modeling, Transformer-based models for VC have been shown effective also in modeling cross-domain video-text representations with cross-attention (Xatt). Despite Xatt's effectiveness, the queries and outputs of attention, which are from different domains, tend to be weakly related. In this paper, we argue that the weak relatedness, or domain discrepancy, could impede a model from learning meaningful cross-domain representations. Hence, we propose a simple yet effective Post-Attention Modulator (PAM) that post-processes Xatt's outputs to narrow the discrepancy. Specifically, PAM modulates and enhances the average similarity over Xatt's queries and outputs. The modulated similarities are then utilized as a weighting basis to interpolate PAM's outputs. In our experiments, PAM was applied to two strong VC baselines, VTransformer and MART, with two different video features on the well-known VC benchmark datasets ActivityNet Captions and YouCookII. According to the results, the proposed PAM brings consistent improvements in, e.g., CIDEr-D at most to 14.5%, as well as other metrics, BLEU and METEOR, considered.
引用
收藏
页码:1536 / 1542
页数:7
相关论文
共 50 条
  • [21] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
  • [22] Contextual Attention Network for Emotional Video Captioning
    Song, Peipei
    Guo, Dan
    Cheng, Jun
    Wang, Meng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1858 - 1867
  • [23] Attention based video captioning framework for Hindi
    Singh, Alok
    Singh, Thoudam Doren
    Bandyopadhyay, Sivaji
    MULTIMEDIA SYSTEMS, 2022, 28 (01) : 195 - 207
  • [24] Jointly Localizing and Describing Events for Dense Video Captioning
    Li, Yehao
    Yao, Ting
    Pan, Yingwei
    Chao, Hongyang
    Mei, Tao
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7492 - 7500
  • [25] Step by Step: A Gradual Approach for Dense Video Captioning
    Choi, Wangyu
    Chen, Jiasi
    Yoon, Jongwon
    IEEE ACCESS, 2023, 11 : 51949 - 51959
  • [26] Dense Video Captioning With Early Linguistic Information Fusion
    Aafaq, Nayyer
    Mian, Ajmal
    Akhtar, Naveed
    Liu, Wei
    Shah, Mubarak
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2309 - 2322
  • [27] Dense video captioning using unsupervised semantic information
    Estevam, Valter
    Laroca, Rayson
    Pedrini, Helio
    Menotti, David
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2025, 107
  • [28] Position embedding fusion on transformer for dense video captioning
    Yang, Sixuan
    Tang, Pengjie
    Wang, Hanli
    Li, Qinyu
    DEVELOPMENTS OF ARTIFICIAL INTELLIGENCE TECHNOLOGIES IN COMPUTATION AND ROBOTICS, 2020, 12 : 792 - 799
  • [29] Parallel Pathway Dense Video Captioning With Deformable Transformer
    Choi, Wangyu
    Chen, Jiasi
    Yoon, Jongwon
    IEEE ACCESS, 2022, 10 : 129899 - 129910
  • [30] Leveraging auxiliary image descriptions for dense video captioning
    Boran, Emre
    Erdem, Aykut
    Ikizler-Cinbis, Nazli
    Erdem, Erkut
    Madhyastha, Pranava
    Specia, Lucia
    PATTERN RECOGNITION LETTERS, 2021, 146 : 70 - 76