MixFormer: A Mixed CNN-Transformer Backbone for Medical Image Segmentation

被引:0
|
作者
Liu, Jun [1 ]
Li, Kunqi [1 ]
Huang, Chun [1 ]
Dong, Hua [1 ]
Song, Yusheng [2 ]
Li, Rihui [3 ,4 ]
机构
[1] Nanchang Hangkong Univ, Dept Informat Engn, Nanchang 330063, Jiangxi, Peoples R China
[2] Peoples Hosp Ganzhou, Dept Intervent Radiol, Ganzhou 341000, Jiangxi, Peoples R China
[3] Univ Macau, Inst Collaborat Innovat, Ctr Cognit & Brain Sci, Macau, Peoples R China
[4] Univ Macau, Fac Sci & Technol, Dept Elect & Comp Engn, Macau, Peoples R China
基金
中国国家自然科学基金;
关键词
Image segmentation; Transformers; Feature extraction; Semantics; Decoding; Computational modeling; Medical diagnostic imaging; Computer architecture; Computer vision; Convolutional neural networks; Medical image segmentation (SEG); mixed convolutional neural network (CNN)-Transformer backbone; mixed multibranch dilated attention (MMDA); multiscale spatial-aware fusion (MSAF); ATTENTION;
D O I
10.1109/TIM.2024.3497060
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Transformers using self-attention mechanisms have recently advanced medical imaging by modeling long-range semantic dependencies, though they lack the ability of convolutional neural networks (CNNs) to capture local spatial details. This study introduced a novel segmentation (SEG) network derived from a mixed CNN-Transformer (MixFormer) feature extraction backbone to enhance medical image segmentation. The MixFormer network seamlessly integrates global and local information from Transformer and CNN architectures during the downsampling process. To comprehensively capture the interscale perspective, we introduced a multiscale spatial-aware fusion (MSAF) module, enabling effective interaction between coarse and fine feature representations. In addition, we proposed a mixed multibranch dilated attention (MMDA) module to bridge the semantic gap between encoding and decoding stages while emphasizing specific regions. Finally, we implemented a CNN-based upsampling approach to recover low-level features, substantially improving segmentation accuracy. Experimental validations on prevalent medical image datasets demonstrated the superior performance of MixFormer. On the Synapse dataset, our approach achieved a mean Dice similarity coefficient (DSC) of 82.64% and a mean Hausdorff distance (HD) of 12.67 mm. On the automated cardiac diagnosis challenge (ACDC) dataset, the DSC was 91.01%. On the international skin imaging collaboration (ISIC) 2018 dataset, the model achieved a mean intersection over union (mIoU) of 0.841, an accuracy of 0.958, a precision of 0.910, a recall of 0.934, and an F1 score of 0.913. For the Kvasir-SEG dataset, we recorded a mean Dice of 0.9247, an mIoU of 0.8615, a precision of 0.9181, and a recall of 0.9463. On the computer vision center (CVC)-ClinicDB dataset, the results were a mean Dice of 0.9441, an mIoU of 0.8922, a precision of 0.9437, and a recall of 0.9458. These findings underscore the superior segmentation performance of MixFormer compared to most mainstream segmentation networks such as CNNs and other Transformer-based structures.
引用
收藏
页数:20
相关论文
共 50 条
  • [41] Hybrid CNN-Transformer Feature Fusion for Single Image Deraining
    Chen, Xiang
    Pan, Jinshan
    Lu, Jiyang
    Fan, Zhentao
    Li, Hao
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 378 - 386
  • [42] MIXMODULE: MIXED CNN KERNEL MODULE FOR MEDICAL IMAGE SEGMENTATION
    Yu, Henry H.
    Feng, Xue
    Wang, Ziwen
    Sun, Hao
    2020 IEEE 17TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2020), 2020, : 1508 - 1512
  • [43] D-TrAttUnet: Toward hybrid CNN-transformer architecture for generic and subtle segmentation in medical images
    Bougourzi F.
    Dornaika F.
    Distante C.
    Taleb-Ahmed A.
    Computers in Biology and Medicine, 2024, 176
  • [44] A Hybrid CNN-Transformer Architecture for Semantic Segmentation of Radar Sounder data
    Ghosh, Raktim
    Bovolo, Francesca
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 1320 - 1323
  • [45] Image Deblurring Based on an Improved CNN-Transformer Combination Network
    Chen, Xiaolin
    Wan, Yuanyuan
    Wang, Donghe
    Wang, Yuqing
    APPLIED SCIENCES-BASEL, 2023, 13 (01):
  • [46] TACT: Text attention based CNN-Transformer network for polyp segmentation
    Zhao, Yiyang
    Li, Jinjiang
    Hua, Zhen
    INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY, 2024, 34 (02)
  • [47] A hybrid CNN-Transformer model for Historical Document Image Binarization
    Rezanezhad, Vahid
    Baierer, Konstantin
    Neudecker, Clemens
    PROCEEDINGS OF THE 2023 INTERNATIONAL WORKSHOP ON HISTORICAL DOCUMENT IMAGING AND PROCESSING, HIP 2023, 2023, : 79 - 84
  • [48] MIXED TRANSFORMER U-NET FOR MEDICAL IMAGE SEGMENTATION
    Wang, Hongyi
    Xie, Shiao
    Lin, Lanfen
    Iwamoto, Yutaro
    Han, Xian-Hua
    Chen, Yen-Wei
    Tong, Ruofeng
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 2390 - 2394
  • [49] PFormer: An efficient CNN-Transformer hybrid network with content-driven P-attention for 3D medical image segmentation
    Gao, Yueyang
    Zhang, Jinhui
    Wei, Siyi
    Li, Zheng
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2025, 101
  • [50] A CNN-Transformer Hybrid Model Based on CSWin Transformer for UAV Image Object Detection
    Lu, Wanjie
    Lan, Chaozhen
    Niu, Chaoyang
    Liu, Wei
    Lyu, Liang
    Shi, Qunshan
    Wang, Shiju
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2023, 16 : 1211 - 1231