CapFormer: A Space-Time Video Description Model using Joint-Attention Transformer

被引:0
|
作者
Moussa, Mahamat [1 ]
Lim, Chern Hong [1 ]
Wong, KokSheik [1 ]
机构
[1] Monash Univ Malaysia, Sch Informat Technol, Subang Jaya, Malaysia
关键词
D O I
10.1109/APSIPAASC58517.2023.10317466
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformers in video understanding are becoming popular due to the recent success of vision transformers. However, video transformers are still emerging and require various tricks to understand video tasks, such as action detection, classification, and description. The need for such tricks is due to videos' temporal and spatial space, which requires different techniques to understand the context better. The most critical component is the attention block, which gives different results if tackled differently. In video description techniques, the model designs are mainly complex due to the integration needed between visual and language contexts. In this work, we proposed a simple yet efficient video description model that relies entirely on an attention mechanism. The model uses a joint-attention mechanism to learn the spatial and temporal context of video frames with their description context. Surprisingly, this integration setting suggests a more straightforward way to achieve the same task achieved by complex networks, and thus, it is efficient in terms of training. To validate the design, we evaluated the proposed architecture on a large video description dataset (MSR-VTT) and compared its result with various works, and it showed promising results over other designs in terms of the ROUGE metric.
引用
收藏
页码:759 / 764
页数:6
相关论文
共 50 条
  • [1] Space-time Mixing Attention for Video Transformer
    Bulat, Adrian
    Perez-Rua, Juan-Manuel
    Sudhakaran, Swathikiran
    Martinez, Brais
    Tzimiropoulos, Georgios
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [2] Space-time multiple description video coding
    Wang, D
    Canagarajah, N
    Bull, D
    VISUAL COMMUNICATIONS AND IMAGE PROCESSING 2006, PTS 1 AND 2, 2006, 6077
  • [3] STARVQA: SPACE-TIME ATTENTION FOR VIDEO QUALITY ASSESSMENT
    Xing, Fengchuang
    Wang, Yuan-Gen
    Wang, Hanpin
    Li, Leida
    Zhu, Guopu
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2326 - 2330
  • [4] Video stabilization using space-time video completion
    Voronin, V.
    Frantc, V.
    Marchuk, V.
    Shrayfel, I.
    Gapon, N.
    Agaian, S.
    MOBILE MULTIMEDIA/IMAGE PROCESSING, SECURITY, AND APPLICATIONS 2016, 2016, 9869
  • [5] Is Space-Time Attention All You Need for Video Understanding?
    Bertasius, Gedas
    Wang, Heng
    Torresani, Lorenzo
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [6] An Abstractive Summarization Model Based on Joint-Attention Mechanism and a Priori Knowledge
    Li, Yuanyuan
    Huang, Yuan
    Huang, Weijian
    Yu, Junhao
    Huang, Zheng
    APPLIED SCIENCES-BASEL, 2023, 13 (07):
  • [7] SPACE-TIME DESCRIPTION OF SQUEEZING
    BIALYNICKABIRULA, Z
    BIALYNICKIBIRULA, I
    JOURNAL OF THE OPTICAL SOCIETY OF AMERICA B-OPTICAL PHYSICS, 1987, 4 (10) : 1621 - 1626
  • [8] Space-Time Video Super-Resolution 3D Transformer
    Zheng, Minyan
    Luo, Jianping
    MULTIMEDIA MODELING, MMM 2023, PT II, 2023, 13834 : 374 - 385
  • [9] COMPLEMENTARITY AND SPACE-TIME DESCRIPTION
    FOLSE, HJ
    BELLS THEOREM, QUANTUM THEORY AND CONCEPTIONS OF THE UNIVERSE, 1989, 37 : 251 - 259
  • [10] Joint source and space-time block coding for MIMO video communications
    Lin, S
    Stefanov, A
    Wang, Y
    VTC2004-FALL: 2004 IEEE 60TH VEHICULAR TECHNOLOGY CONFERENCE, VOLS 1-7: WIRELESS TECHNOLOGIES FOR GLOBAL SECURITY, 2004, : 2508 - 2512