MixFormer: A Mixed CNN-Transformer Backbone for Medical Image Segmentation

被引:0
|
作者
Liu, Jun [1 ]
Li, Kunqi [1 ]
Huang, Chun [1 ]
Dong, Hua [1 ]
Song, Yusheng [2 ]
Li, Rihui [3 ,4 ]
机构
[1] Nanchang Hangkong Univ, Dept Informat Engn, Nanchang 330063, Jiangxi, Peoples R China
[2] Peoples Hosp Ganzhou, Dept Intervent Radiol, Ganzhou 341000, Jiangxi, Peoples R China
[3] Univ Macau, Inst Collaborat Innovat, Ctr Cognit & Brain Sci, Macau, Peoples R China
[4] Univ Macau, Fac Sci & Technol, Dept Elect & Comp Engn, Macau, Peoples R China
基金
中国国家自然科学基金;
关键词
Image segmentation; Transformers; Feature extraction; Semantics; Decoding; Computational modeling; Medical diagnostic imaging; Computer architecture; Computer vision; Convolutional neural networks; Medical image segmentation (SEG); mixed convolutional neural network (CNN)-Transformer backbone; mixed multibranch dilated attention (MMDA); multiscale spatial-aware fusion (MSAF); ATTENTION;
D O I
10.1109/TIM.2024.3497060
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Transformers using self-attention mechanisms have recently advanced medical imaging by modeling long-range semantic dependencies, though they lack the ability of convolutional neural networks (CNNs) to capture local spatial details. This study introduced a novel segmentation (SEG) network derived from a mixed CNN-Transformer (MixFormer) feature extraction backbone to enhance medical image segmentation. The MixFormer network seamlessly integrates global and local information from Transformer and CNN architectures during the downsampling process. To comprehensively capture the interscale perspective, we introduced a multiscale spatial-aware fusion (MSAF) module, enabling effective interaction between coarse and fine feature representations. In addition, we proposed a mixed multibranch dilated attention (MMDA) module to bridge the semantic gap between encoding and decoding stages while emphasizing specific regions. Finally, we implemented a CNN-based upsampling approach to recover low-level features, substantially improving segmentation accuracy. Experimental validations on prevalent medical image datasets demonstrated the superior performance of MixFormer. On the Synapse dataset, our approach achieved a mean Dice similarity coefficient (DSC) of 82.64% and a mean Hausdorff distance (HD) of 12.67 mm. On the automated cardiac diagnosis challenge (ACDC) dataset, the DSC was 91.01%. On the international skin imaging collaboration (ISIC) 2018 dataset, the model achieved a mean intersection over union (mIoU) of 0.841, an accuracy of 0.958, a precision of 0.910, a recall of 0.934, and an F1 score of 0.913. For the Kvasir-SEG dataset, we recorded a mean Dice of 0.9247, an mIoU of 0.8615, a precision of 0.9181, and a recall of 0.9463. On the computer vision center (CVC)-ClinicDB dataset, the results were a mean Dice of 0.9441, an mIoU of 0.8922, a precision of 0.9437, and a recall of 0.9458. These findings underscore the superior segmentation performance of MixFormer compared to most mainstream segmentation networks such as CNNs and other Transformer-based structures.
引用
收藏
页数:20
相关论文
共 50 条
  • [31] SMESwin Unet: Merging CNN and Transformer for Medical Image Segmentation
    Wang, Ziheng
    Min, Xiongkuo
    Shi, Fangyu
    Jin, Ruinian
    Nawrin, Saida S.
    Yu, Ichen
    Nagatomi, Ryoichi
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT V, 2022, 13435 : 517 - 526
  • [32] Hierarchical Decoder with Parallel Transformer and CNN for Medical Image Segmentation
    Li, Shijie
    Gong, Yu
    Xiang, Qingyuan
    Li, Zheng
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XIV, 2025, 15044 : 133 - 147
  • [33] Parallel Transformer-CNN Model for Medical Image Segmentation
    Zhou, Mingkun
    Nie, Xueyun
    Liu, Yuhang
    Li, Doudou
    2024 5TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATION, ICCEA 2024, 2024, : 1048 - 1051
  • [34] Image harmonization with Simple Hybrid CNN-Transformer Network
    Li, Guanlin
    Zhao, Bin
    Li, Xuelong
    NEURAL NETWORKS, 2024, 180
  • [35] HCformer: Hybrid CNN-Transformer for LDCT Image Denoising
    Yuan, Jinli
    Zhou, Feng
    Guo, Zhitao
    Li, Xiaozeng
    Yu, Hengyong
    JOURNAL OF DIGITAL IMAGING, 2023, 36 (05) : 2290 - 2305
  • [36] MSCS: Multiscale Consistency Supervision With CNN-Transformer Collaboration for Semisupervised Histopathology Image Semantic Segmentation
    Hsieh, Min-En
    Chiou, Chien-Yu
    Tsai, Hung-Wen
    Chang, Yu-Cheng
    Chung, Pau-Choo
    IEEE Transactions on Artificial Intelligence, 2024, 5 (12): : 6356 - 6368
  • [37] HCformer: Hybrid CNN-Transformer for LDCT Image Denoising
    Jinli Yuan
    Feng Zhou
    Zhitao Guo
    Xiaozeng Li
    Hengyong Yu
    Journal of Digital Imaging, 2023, 36 (5) : 2290 - 2305
  • [38] Progressive CNN-transformer semantic compensation network for polyp segmentation
    Li, Daxiang
    Li, Denghui
    Liu, Ying
    Tang, Yao
    Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2024, 32 (16): : 2523 - 2536
  • [39] Cross Attention Multi Scale CNN-Transformer Hybrid Encoder Is General Medical Image Learner
    Zhou, Rongzhou
    Yao, Junfeng
    Hong, Qingqi
    Li, Xingxin
    Cao, Xianpeng
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT XIII, 2024, 14437 : 85 - 97
  • [40] CoTrFuse: a novel framework by fusing CNN and transformer for medical image segmentation
    Chen, Yuanbin
    Wang, Tao
    Tang, Hui
    Zhao, Longxuan
    Zhang, Xinlin
    Tan, Tao
    Gao, Qinquan
    Du, Min
    Tong, Tong
    PHYSICS IN MEDICINE AND BIOLOGY, 2023, 68 (17):