A Comprehensive Survey of Recent Transformers in Image, Video and Diffusion Models

被引：2

作者：

Le, Dinh Phu Cuong ^{[1
,2
]}

Wang, Dong ^{[1
]}

Le, Viet-Tuan ^{[3
]}

机构：

[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Peoples R China

[2] Yersin Univ Da Lat, Fac Informat Technol, Da Lat 66100, Vietnam

[3] Ho Chi Minh City Open Univ, Fac Informat Technol, Ho Chi Minh City 722000, Vietnam

来源：

CMC-COMPUTERS MATERIALS & CONTINUA | 2024年 / 80卷 / 01期

基金：

湖南省自然科学基金; 中国国家自然科学基金;

关键词：

Transformer; vision transformer; self-attention; hierarchical transformer; diffusion models;

D O I：

10.32604/cmc.2024.050790

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Transformer models have emerged as dominant networks for various tasks in computer vision compared to Convolutional Neural Networks (CNNs). The transformers demonstrate the ability to model long-range dependencies by utilizing a self-attention mechanism. This study aims to provide a comprehensive survey of recent transformer- based approaches in image and video applications, as well as diffusion models. We begin by discussing existing surveys of vision transformers and comparing them to this work. Then, we review the main components of a vanilla transformer network, including the self-attention mechanism, feed-forward network, position encoding, etc. In the main part of this survey, we review recent transformer-based models in three categories: Transformer for downstream tasks, Vision Transformer for Generation, and Vision Transformer for Segmentation. We also provide a comprehensive overview of recent transformer models for video tasks and diffusion models. We compare the performance of various hierarchical transformer networks for multiple tasks on popular benchmark datasets. Finally, we explore some future research directions to further improve the field.

引用

页码：37 / 60

页数：24

共 50 条

[31] Efficient CNNs and Transformers for Video Understanding and Image Synthesis
Gall, Juergen
PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 670 - 670
[32] A comprehensive survey on applications of transformers for deep learning tasks
Islam, Saidul
Elmekki, Hanae
Elsebai, Ahmed
Bentahar, Jamal
Drawel, Nagat
Rjoub, Gaith
Pedrycz, Witold
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 241
[33] Towards Transferable Adversarial Attacks on Image and Video Transformers
Wei, Zhipeng
Chen, Jingjing
Goldblum, Micah
Wu, Zuxuan
Goldstein, Tom
Jiang, Yu-Gang
Davis, Larry S.
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 6346 - 6358
[34] Adversarial attacks and defenses on text-to-image diffusion models: A survey
Zhang, Chenyu
Hu, Mingwang
Li, Wenhui
Wang, Lanjun
INFORMATION FUSION, 2025, 114
[35] Exploring Image Transformations with Diffusion Models: A Survey of Applications and Implementation Code
Arellano, Silvia
Otero, Beatriz
Tous, Ruben
MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE, LOD 2023, PT II, 2024, 14506 : 19 - 33
[36] Video Colorization with Pre-trained Text-to-Image Diffusion Models
Liu, Hanyuan
Xie, Minshan
Xing, Jinbo
Li, Chengze
Wong, Tien-Tsin
arXiv, 2023,
[37] Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models
Chen, Tingxiu
Shi, Yilei
Zheng, Zixuan
Yan, Bingcong
Hu, Jingliang
Zhu, Xiao Xiang
Mou, Lichao
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT IV, 2024, 15004 : 764 - 774
[38] A survey on image and video stitching
LYU W.
ZHOU Z.
CHEN L.
ZHOU Y.
Virtual Reality and Intelligent Hardware, 2019, 1 (01): : 55 - 83
[39] Image and video compression: A survey
Clarke, RJ
INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY, 1999, 10 (01) : 20 - 32
[40] Image and Video Matting: A Survey
Wang, Jue
Cohen, Michael F.
FOUNDATIONS AND TRENDS IN COMPUTER GRAPHICS AND VISION, 2007, 3 (02): : 97 - 180

← 1 2 3 4 5 →