A Comprehensive Survey of Recent Transformers in Image, Video and Diffusion Models

被引：2

作者：

Le, Dinh Phu Cuong ^{[1
,2
]}

Wang, Dong ^{[1
]}

Le, Viet-Tuan ^{[3
]}

机构：

[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Peoples R China

[2] Yersin Univ Da Lat, Fac Informat Technol, Da Lat 66100, Vietnam

[3] Ho Chi Minh City Open Univ, Fac Informat Technol, Ho Chi Minh City 722000, Vietnam

来源：

CMC-COMPUTERS MATERIALS & CONTINUA | 2024年 / 80卷 / 01期

基金：

湖南省自然科学基金; 中国国家自然科学基金;

关键词：

Transformer; vision transformer; self-attention; hierarchical transformer; diffusion models;

D O I：

10.32604/cmc.2024.050790

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Transformer models have emerged as dominant networks for various tasks in computer vision compared to Convolutional Neural Networks (CNNs). The transformers demonstrate the ability to model long-range dependencies by utilizing a self-attention mechanism. This study aims to provide a comprehensive survey of recent transformer- based approaches in image and video applications, as well as diffusion models. We begin by discussing existing surveys of vision transformers and comparing them to this work. Then, we review the main components of a vanilla transformer network, including the self-attention mechanism, feed-forward network, position encoding, etc. In the main part of this survey, we review recent transformer-based models in three categories: Transformer for downstream tasks, Vision Transformer for Generation, and Vision Transformer for Segmentation. We also provide a comprehensive overview of recent transformer models for video tasks and diffusion models. We compare the performance of various hierarchical transformer networks for multiple tasks on popular benchmark datasets. Finally, we explore some future research directions to further improve the field.

引用

页码：37 / 60

页数：24

共 50 条

[41] A comprehensive survey of procedural video datasets
Tan, Hui Li
Zhu, Hongyuan
Lim, Joo-Hwee
Tan, Cheston
COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 202
[42] Video Skimming: Taxonomy and Comprehensive Survey
Vivekraj, V. K.
Sen, Debashis
Raman, Balasubramanian
ACM COMPUTING SURVEYS, 2019, 52 (05)
[43] Video Frame Interpolation: A Comprehensive Survey
Dong, Jiong
Ota, Kaoru
Dong, Mianxiong
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
[44] A Comprehensive Survey of Image Steganography
Kalaiarasi, G.
Sudharani, B.
Jonnalagadda, Sharon Christiana
Battula, Harsha Vardhan
Sanagala, Bhavana
2ND INTERNATIONAL CONFERENCE ON SUSTAINABLE COMPUTING AND SMART SYSTEMS, ICSCSS 2024, 2024, : 1225 - 1229
[45] Comprehensive Analysis of Models and Operational Characteristics of Piezoelectric Transformers
Wang, Le
Burgos, Rolando P.
2020 THIRTY-FIFTH ANNUAL IEEE APPLIED POWER ELECTRONICS CONFERENCE AND EXPOSITION (APEC 2020), 2020, : 1422 - 1429
[46] Bridging the metrics gap in image style transfer: A comprehensive survey of models and criteria
Zhou, Xiaotong
Zheng, Yuhui
Yang, Junming
NEUROCOMPUTING, 2025, 624
[47] Latent Diffusion Models for Image Watermarking: A Review of Recent Trends and Future Directions
Hur, Hongjun
Kang, Minjae
Seo, Sanghyeok
Hou, Jong-Uk
ELECTRONICS, 2025, 14 (01):
[48] Recent Advances in LoRa: A Comprehensive Survey
Sun, Zehua
Yang, Huanqi
Liu, Kai
Yin, Zhimeng
Li, Zhenjiang
Xu, Weitao
ACM TRANSACTIONS ON SENSOR NETWORKS, 2022, 18 (04)
[49] Comprehensive Survey of OLAP Models
Kaur, Harkiran
Kaur, Gursimran
HARMONY SEARCH AND NATURE INSPIRED OPTIMIZATION ALGORITHMS, 2019, 741 : 415 - 422
[50] Recent advances in image and video retrieval
O'Connor, NE
Kompatsiaris, I
IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING, 2005, 152 (06): : 851 - 851

← 1 2 3 4 5 →