A Comprehensive Survey of Recent Transformers in Image, Video and Diffusion Models

被引:2
|
作者
Le, Dinh Phu Cuong [1 ,2 ]
Wang, Dong [1 ]
Le, Viet-Tuan [3 ]
机构
[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Peoples R China
[2] Yersin Univ Da Lat, Fac Informat Technol, Da Lat 66100, Vietnam
[3] Ho Chi Minh City Open Univ, Fac Informat Technol, Ho Chi Minh City 722000, Vietnam
来源
CMC-COMPUTERS MATERIALS & CONTINUA | 2024年 / 80卷 / 01期
基金
湖南省自然科学基金; 中国国家自然科学基金;
关键词
Transformer; vision transformer; self-attention; hierarchical transformer; diffusion models;
D O I
10.32604/cmc.2024.050790
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Transformer models have emerged as dominant networks for various tasks in computer vision compared to Convolutional Neural Networks (CNNs). The transformers demonstrate the ability to model long-range dependencies by utilizing a self-attention mechanism. This study aims to provide a comprehensive survey of recent transformer- based approaches in image and video applications, as well as diffusion models. We begin by discussing existing surveys of vision transformers and comparing them to this work. Then, we review the main components of a vanilla transformer network, including the self-attention mechanism, feed-forward network, position encoding, etc. In the main part of this survey, we review recent transformer-based models in three categories: Transformer for downstream tasks, Vision Transformer for Generation, and Vision Transformer for Segmentation. We also provide a comprehensive overview of recent transformer models for video tasks and diffusion models. We compare the performance of various hierarchical transformer networks for multiple tasks on popular benchmark datasets. Finally, we explore some future research directions to further improve the field.
引用
收藏
页码:37 / 60
页数:24
相关论文
共 50 条
  • [31] Efficient CNNs and Transformers for Video Understanding and Image Synthesis
    Gall, Juergen
    PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 670 - 670
  • [32] A comprehensive survey on applications of transformers for deep learning tasks
    Islam, Saidul
    Elmekki, Hanae
    Elsebai, Ahmed
    Bentahar, Jamal
    Drawel, Nagat
    Rjoub, Gaith
    Pedrycz, Witold
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 241
  • [33] Towards Transferable Adversarial Attacks on Image and Video Transformers
    Wei, Zhipeng
    Chen, Jingjing
    Goldblum, Micah
    Wu, Zuxuan
    Goldstein, Tom
    Jiang, Yu-Gang
    Davis, Larry S.
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 6346 - 6358
  • [34] Adversarial attacks and defenses on text-to-image diffusion models: A survey
    Zhang, Chenyu
    Hu, Mingwang
    Li, Wenhui
    Wang, Lanjun
    INFORMATION FUSION, 2025, 114
  • [35] Exploring Image Transformations with Diffusion Models: A Survey of Applications and Implementation Code
    Arellano, Silvia
    Otero, Beatriz
    Tous, Ruben
    MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE, LOD 2023, PT II, 2024, 14506 : 19 - 33
  • [36] Video Colorization with Pre-trained Text-to-Image Diffusion Models
    Liu, Hanyuan
    Xie, Minshan
    Xing, Jinbo
    Li, Chengze
    Wong, Tien-Tsin
    arXiv, 2023,
  • [37] Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models
    Chen, Tingxiu
    Shi, Yilei
    Zheng, Zixuan
    Yan, Bingcong
    Hu, Jingliang
    Zhu, Xiao Xiang
    Mou, Lichao
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT IV, 2024, 15004 : 764 - 774
  • [38] A survey on image and video stitching
    LYU W.
    ZHOU Z.
    CHEN L.
    ZHOU Y.
    Virtual Reality and Intelligent Hardware, 2019, 1 (01): : 55 - 83
  • [39] Image and video compression: A survey
    Clarke, RJ
    INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY, 1999, 10 (01) : 20 - 32
  • [40] Image and Video Matting: A Survey
    Wang, Jue
    Cohen, Michael F.
    FOUNDATIONS AND TRENDS IN COMPUTER GRAPHICS AND VISION, 2007, 3 (02): : 97 - 180