A Comprehensive Survey of Recent Transformers in Image, Video and Diffusion Models

被引:2
|
作者
Le, Dinh Phu Cuong [1 ,2 ]
Wang, Dong [1 ]
Le, Viet-Tuan [3 ]
机构
[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Peoples R China
[2] Yersin Univ Da Lat, Fac Informat Technol, Da Lat 66100, Vietnam
[3] Ho Chi Minh City Open Univ, Fac Informat Technol, Ho Chi Minh City 722000, Vietnam
来源
CMC-COMPUTERS MATERIALS & CONTINUA | 2024年 / 80卷 / 01期
基金
湖南省自然科学基金; 中国国家自然科学基金;
关键词
Transformer; vision transformer; self-attention; hierarchical transformer; diffusion models;
D O I
10.32604/cmc.2024.050790
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Transformer models have emerged as dominant networks for various tasks in computer vision compared to Convolutional Neural Networks (CNNs). The transformers demonstrate the ability to model long-range dependencies by utilizing a self-attention mechanism. This study aims to provide a comprehensive survey of recent transformer- based approaches in image and video applications, as well as diffusion models. We begin by discussing existing surveys of vision transformers and comparing them to this work. Then, we review the main components of a vanilla transformer network, including the self-attention mechanism, feed-forward network, position encoding, etc. In the main part of this survey, we review recent transformer-based models in three categories: Transformer for downstream tasks, Vision Transformer for Generation, and Vision Transformer for Segmentation. We also provide a comprehensive overview of recent transformer models for video tasks and diffusion models. We compare the performance of various hierarchical transformer networks for multiple tasks on popular benchmark datasets. Finally, we explore some future research directions to further improve the field.
引用
收藏
页码:37 / 60
页数:24
相关论文
共 50 条
  • [41] A comprehensive survey of procedural video datasets
    Tan, Hui Li
    Zhu, Hongyuan
    Lim, Joo-Hwee
    Tan, Cheston
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 202
  • [42] Video Skimming: Taxonomy and Comprehensive Survey
    Vivekraj, V. K.
    Sen, Debashis
    Raman, Balasubramanian
    ACM COMPUTING SURVEYS, 2019, 52 (05)
  • [43] Video Frame Interpolation: A Comprehensive Survey
    Dong, Jiong
    Ota, Kaoru
    Dong, Mianxiong
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [44] A Comprehensive Survey of Image Steganography
    Kalaiarasi, G.
    Sudharani, B.
    Jonnalagadda, Sharon Christiana
    Battula, Harsha Vardhan
    Sanagala, Bhavana
    2ND INTERNATIONAL CONFERENCE ON SUSTAINABLE COMPUTING AND SMART SYSTEMS, ICSCSS 2024, 2024, : 1225 - 1229
  • [45] Comprehensive Analysis of Models and Operational Characteristics of Piezoelectric Transformers
    Wang, Le
    Burgos, Rolando P.
    2020 THIRTY-FIFTH ANNUAL IEEE APPLIED POWER ELECTRONICS CONFERENCE AND EXPOSITION (APEC 2020), 2020, : 1422 - 1429
  • [46] Bridging the metrics gap in image style transfer: A comprehensive survey of models and criteria
    Zhou, Xiaotong
    Zheng, Yuhui
    Yang, Junming
    NEUROCOMPUTING, 2025, 624
  • [47] Latent Diffusion Models for Image Watermarking: A Review of Recent Trends and Future Directions
    Hur, Hongjun
    Kang, Minjae
    Seo, Sanghyeok
    Hou, Jong-Uk
    ELECTRONICS, 2025, 14 (01):
  • [48] Recent Advances in LoRa: A Comprehensive Survey
    Sun, Zehua
    Yang, Huanqi
    Liu, Kai
    Yin, Zhimeng
    Li, Zhenjiang
    Xu, Weitao
    ACM TRANSACTIONS ON SENSOR NETWORKS, 2022, 18 (04)
  • [49] Comprehensive Survey of OLAP Models
    Kaur, Harkiran
    Kaur, Gursimran
    HARMONY SEARCH AND NATURE INSPIRED OPTIMIZATION ALGORITHMS, 2019, 741 : 415 - 422
  • [50] Recent advances in image and video retrieval
    O'Connor, NE
    Kompatsiaris, I
    IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING, 2005, 152 (06): : 851 - 851