CapFormer: A Space-Time Video Description Model using Joint-Attention Transformer

被引：0

作者：

Moussa, Mahamat ^{[1
]}

Lim, Chern Hong ^{[1
]}

Wong, KokSheik ^{[1
]}

机构：

[1] Monash Univ Malaysia, Sch Informat Technol, Subang Jaya, Malaysia

来源：

2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC | 2023年

关键词：

D O I：

10.1109/APSIPAASC58517.2023.10317466

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformers in video understanding are becoming popular due to the recent success of vision transformers. However, video transformers are still emerging and require various tricks to understand video tasks, such as action detection, classification, and description. The need for such tricks is due to videos' temporal and spatial space, which requires different techniques to understand the context better. The most critical component is the attention block, which gives different results if tackled differently. In video description techniques, the model designs are mainly complex due to the integration needed between visual and language contexts. In this work, we proposed a simple yet efficient video description model that relies entirely on an attention mechanism. The model uses a joint-attention mechanism to learn the spatial and temporal context of video frames with their description context. Surprisingly, this integration setting suggests a more straightforward way to achieve the same task achieved by complex networks, and thus, it is efficient in terms of training. To validate the design, we evaluated the proposed architecture on a large video description dataset (MSR-VTT) and compared its result with various works, and it showed promising results over other designs in terms of the ROUGE metric.

引用

页码：759 / 764

页数：6

共 50 条

[1] Space-time Mixing Attention for Video Transformer
Bulat, Adrian
Perez-Rua, Juan-Manuel
Sudhakaran, Swathikiran
Martinez, Brais
Tzimiropoulos, Georgios
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
[2] Space-time multiple description video coding
Wang, D
Canagarajah, N
Bull, D
VISUAL COMMUNICATIONS AND IMAGE PROCESSING 2006, PTS 1 AND 2, 2006, 6077
[3] STARVQA: SPACE-TIME ATTENTION FOR VIDEO QUALITY ASSESSMENT
Xing, Fengchuang
Wang, Yuan-Gen
Wang, Hanpin
Li, Leida
Zhu, Guopu
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2326 - 2330
[4] Video stabilization using space-time video completion
Voronin, V.
Frantc, V.
Marchuk, V.
Shrayfel, I.
Gapon, N.
Agaian, S.
MOBILE MULTIMEDIA/IMAGE PROCESSING, SECURITY, AND APPLICATIONS 2016, 2016, 9869
[5] Is Space-Time Attention All You Need for Video Understanding?
Bertasius, Gedas
Wang, Heng
Torresani, Lorenzo
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[6] An Abstractive Summarization Model Based on Joint-Attention Mechanism and a Priori Knowledge
Li, Yuanyuan
Huang, Yuan
Huang, Weijian
Yu, Junhao
Huang, Zheng
APPLIED SCIENCES-BASEL, 2023, 13 (07):
[7] SPACE-TIME DESCRIPTION OF SQUEEZING
BIALYNICKABIRULA, Z
BIALYNICKIBIRULA, I
JOURNAL OF THE OPTICAL SOCIETY OF AMERICA B-OPTICAL PHYSICS, 1987, 4 (10) : 1621 - 1626
[8] Space-Time Video Super-Resolution 3D Transformer
Zheng, Minyan
Luo, Jianping
MULTIMEDIA MODELING, MMM 2023, PT II, 2023, 13834 : 374 - 385
[9] COMPLEMENTARITY AND SPACE-TIME DESCRIPTION
FOLSE, HJ
BELLS THEOREM, QUANTUM THEORY AND CONCEPTIONS OF THE UNIVERSE, 1989, 37 : 251 - 259
[10] Joint source and space-time block coding for MIMO video communications
Lin, S
Stefanov, A
Wang, Y
VTC2004-FALL: 2004 IEEE 60TH VEHICULAR TECHNOLOGY CONFERENCE, VOLS 1-7: WIRELESS TECHNOLOGIES FOR GLOBAL SECURITY, 2004, : 2508 - 2512

← 1 2 3 4 5 →