A Comprehensive Survey of Recent Transformers in Image, Video and Diffusion Models

被引:2
|
作者
Le, Dinh Phu Cuong [1 ,2 ]
Wang, Dong [1 ]
Le, Viet-Tuan [3 ]
机构
[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Peoples R China
[2] Yersin Univ Da Lat, Fac Informat Technol, Da Lat 66100, Vietnam
[3] Ho Chi Minh City Open Univ, Fac Informat Technol, Ho Chi Minh City 722000, Vietnam
来源
CMC-COMPUTERS MATERIALS & CONTINUA | 2024年 / 80卷 / 01期
基金
湖南省自然科学基金; 中国国家自然科学基金;
关键词
Transformer; vision transformer; self-attention; hierarchical transformer; diffusion models;
D O I
10.32604/cmc.2024.050790
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Transformer models have emerged as dominant networks for various tasks in computer vision compared to Convolutional Neural Networks (CNNs). The transformers demonstrate the ability to model long-range dependencies by utilizing a self-attention mechanism. This study aims to provide a comprehensive survey of recent transformer- based approaches in image and video applications, as well as diffusion models. We begin by discussing existing surveys of vision transformers and comparing them to this work. Then, we review the main components of a vanilla transformer network, including the self-attention mechanism, feed-forward network, position encoding, etc. In the main part of this survey, we review recent transformer-based models in three categories: Transformer for downstream tasks, Vision Transformer for Generation, and Vision Transformer for Segmentation. We also provide a comprehensive overview of recent transformer models for video tasks and diffusion models. We compare the performance of various hierarchical transformer networks for multiple tasks on popular benchmark datasets. Finally, we explore some future research directions to further improve the field.
引用
收藏
页码:37 / 60
页数:24
相关论文
共 50 条
  • [21] A comprehensive survey of image and video forgery techniques: variants, challenges, and future directions
    Nabi, Syed Tufael
    Kumar, Munish
    Singh, Paramjeet
    Aggarwal, Naveen
    Kumar, Krishan
    MULTIMEDIA SYSTEMS, 2022, 28 (03) : 939 - 992
  • [22] A comprehensive survey of image and video forgery techniques: variants, challenges, and future directions
    Syed Tufael Nabi
    Munish Kumar
    Paramjeet Singh
    Naveen Aggarwal
    Krishan Kumar
    Multimedia Systems, 2022, 28 : 939 - 992
  • [23] Towards Consistent Video Editing with Text-to-Image Diffusion Models
    Zhang, Zicheng
    Li, Bonan
    Nie, Xuecheng
    Han, Congying
    Guo, Tiande
    Liu, Luoqi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [24] Conditional Image-to-Video Generation with Latent Flow Diffusion Models
    Ni, Haomiao
    Shi, Changhao
    Li, Kai
    Huang, Sharon X.
    Min, Martin Renqiang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18444 - 18455
  • [25] DiffiT: Diffusion Vision Transformers for Image Generation
    Hatamizadeh, Ali
    Song, Jiaming
    Liu, Guilin
    Kautz, Jan
    Vahdat, Arash
    COMPUTER VISION - ECCV 2024, PT VIII, 2025, 15066 : 37 - 55
  • [26] MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance
    Chu, Ernie
    Huang, Tzuhsuan
    Lin, Shuo-Yen
    Chen, Jun-Cheng
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1353 - 1361
  • [27] Vision Transformers for Image Classification: A Comparative Survey
    Wang, Yaoli
    Deng, Yaojun
    Zheng, Yuanjin
    Chattopadhyay, Pratik
    Wang, Lipo
    TECHNOLOGIES, 2025, 13 (01)
  • [28] TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models
    Zhang, Zhongwei
    Long, Fuchen
    Pan, Yingwei
    Qiu, Zhaofan
    Yao, Ting
    Cao, Yang
    Mei, Tao
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 8671 - 8681
  • [29] An Overview of Vision Transformers for Image Processing: A Survey
    Kameswari, Ch. Sita
    Kavitha, J.
    Reddy, T. Srinivas
    Chinthaguntla, Balaswamy
    Jagatheesaperumal, Senthil Kumar
    Gaftandzhieva, Silvia
    Doneva, Rositsa
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (08) : 273 - 289
  • [30] TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models
    Zhang, Zhongwei
    Long, Fuchen
    Pan, Yingwei
    Qiu, Zhaofan
    Yao, Ting
    Cao, Yang
    Mei, Tao
    arXiv,