Post-Attention Modulator for Dense Video Captioning

被引:1
|
作者
Guo, Zixin [1 ]
Wang, Tzu-Jui Julius [1 ]
Laaksonen, Jorma [1 ]
机构
[1] Aalto Univ, Sch Sci, Dept Comp Sci, Espoo, Finland
基金
芬兰科学院;
关键词
D O I
10.1109/ICPR56361.2022.9956260
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense video captioning (VC) aims at generating a paragraph-long description for events in video segments. Borrowing from the success in language modeling, Transformer-based models for VC have been shown effective also in modeling cross-domain video-text representations with cross-attention (Xatt). Despite Xatt's effectiveness, the queries and outputs of attention, which are from different domains, tend to be weakly related. In this paper, we argue that the weak relatedness, or domain discrepancy, could impede a model from learning meaningful cross-domain representations. Hence, we propose a simple yet effective Post-Attention Modulator (PAM) that post-processes Xatt's outputs to narrow the discrepancy. Specifically, PAM modulates and enhances the average similarity over Xatt's queries and outputs. The modulated similarities are then utilized as a weighting basis to interpolate PAM's outputs. In our experiments, PAM was applied to two strong VC baselines, VTransformer and MART, with two different video features on the well-known VC benchmark datasets ActivityNet Captions and YouCookII. According to the results, the proposed PAM brings consistent improvements in, e.g., CIDEr-D at most to 14.5%, as well as other metrics, BLEU and METEOR, considered.
引用
收藏
页码:1536 / 1542
页数:7
相关论文
共 50 条
  • [1] Dense video captioning based on local attention
    Qian, Yong
    Mao, Yingchi
    Chen, Zhihao
    Li, Chang
    Bloh, Olano Teah
    Huang, Qian
    IET IMAGE PROCESSING, 2023, 17 (09) : 2673 - 2685
  • [2] Survey of Dense Video Captioning
    Huang, Xiankai
    Zhang, Jiayu
    Wang, Xinyu
    Wang, Xiaochuan
    Liu, Ruijun
    Computer Engineering and Applications, 2023, 59 (12): : 28 - 48
  • [3] Streamlined Dense Video Captioning
    Mun, Jonghwan
    Yang, Linjie
    Ren, Zhou
    Xu, Ning
    Han, Bohyung
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3581 - +
  • [4] MULTI-MODAL HIERARCHICAL ATTENTION-BASED DENSE VIDEO CAPTIONING
    Munusamy, Hemalatha
    Sekhar, Chandra C.
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 475 - 479
  • [5] Weakly Supervised Dense Video Captioning
    Shen, Zhiqiang
    Li, Jianguo
    Su, Zhou
    Li, Minjun
    Chen, Yurong
    Jiang, Yu-Gang
    Xue, Xiangyang
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5159 - 5167
  • [6] An Efficient Framework for Dense Video Captioning
    Suin, Maitreya
    Rajagopalan, A. N.
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 12039 - 12046
  • [7] Multimodal Pretraining for Dense Video Captioning
    Huang, Gabriel
    Pang, Bo
    Zhu, Zhenhai
    Rivera, Clara
    Soricut, Radu
    1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
  • [8] Dense Video Captioning for Incomplete Videos
    Dang, Xuan
    Wang, Guolong
    Xiong, Kun
    Qin, Zheng
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 665 - 676
  • [9] Dense Video Captioning with Hierarchical Attention-Based Encoder-Decoder Networks
    Yu, Mingjing
    Zheng, Huicheng
    Liu, Zehua
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [10] Average Sparse Attention for Dense Video Captioning From Multiperspective Edge-Computing Cameras
    Huang, Ling-Hsuan
    Lu, Ching-Hu
    IEEE SYSTEMS JOURNAL, 2024, 18 (04): : 1939 - 1950