SBAT: Video Captioning with Sparse Boundary-Aware Transformer

被引:0
|
作者
Jin, Tao [1 ]
Huang, Siyu [2 ]
Chen, Ming [3 ]
Li, Yingming [1 ]
Zhang, Zhongfei [4 ]
机构
[1] Zhejiang Univ, Coll Informat Sci & Elect Engn, Hangzhou, Peoples R China
[2] Baidu Res, Shanghai, Peoples R China
[3] Alibaba Grp, Hangzhou, Peoples R China
[4] Binghamton Univ, Dept Comp Sci, Binghamton, NY USA
基金
国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.
引用
收藏
页码:630 / 636
页数:7
相关论文
共 50 条
  • [1] Hierarchical Boundary-Aware Neural Encoder for Video Captioning
    Baraldi, Lorenzo
    Grana, Costantino
    Cucchiara, Rita
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3185 - 3194
  • [2] Video captioning with boundary-aware hierarchical language decoding and joint video prediction
    Shi, Xiangxi
    Cai, Jianfei
    Gu, Jiuxiang
    Joty, Shafiq
    NEUROCOMPUTING, 2020, 417 : 347 - 356
  • [3] Boundary-Aware Face Alignment with Enhanced HourglassNet and Transformer
    Li, Yingxin
    Niu, Dongmei
    Peng, Jingliang
    APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2023, 12 (01)
  • [4] VIDEO SEGMENTATION VIA BOUNDARY-AWARE FLOW
    Chen, Ding-Jie
    Chen, Hwann-Tzong
    Chang, Long-Wen
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 3340 - 3344
  • [5] Faster Boundary-aware Transformer for Breast Cancer Segmentation
    Zhou, Xin
    Yin, Xiaoxia
    2023 15TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTATIONAL INTELLIGENCE, ICACI, 2023,
  • [6] Lightweight Boundary-Aware Face Alignment with Compressed HourglassNet and Transformer
    Wang, Wenhui
    Li, Yingxin
    Li, Ziqiang
    Peng, Jingliang
    APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2023, 12 (01)
  • [7] A Boundary-aware Distillation Network for Compressed Video Semantic Segmentation
    Lu, Hongchao
    Deng, Zhidong
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5354 - 5359
  • [8] Boundary-Aware Noise-Resistant Video Moment Retrieval
    Yu, Fengzhen
    Gu, Xiaodong
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT III, 2024, 15018 : 193 - 206
  • [9] Attentive Boundary-Aware Fusion for Defect Semantic Segmentation Using Transformer
    Yeung, Ching-Chi
    Lam, Kin-Man
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2023, 72
  • [10] Pre-training-driven Multimodal Boundary-aware Vision Transformer
    Shi Z.-N.
    Chen H.-P.
    Zhang D.
    Shen X.-J.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2051 - 2067