Aggregating Global and Local Representations via Hybrid Transformer for Video Deraining

被引:1
|
作者
Mao, Deqian [1 ]
Gao, Shanshan [2 ]
Li, Zhenyu [1 ]
Dai, Honghao [1 ]
Zhang, Yunfeng [1 ,3 ]
Zhou, Yuanfeng
机构
[1] Shandong Univ Finance & Econ, Sch Comp Sci & Technol, Jinan 250014, Peoples R China
[2] Shandong Univ Finance & Econ, Sch Comp Sci & Technol, Shandong China US Digital Media Int Cooperat Res C, Key Lab Digital Media Technol Shandong Prov, Jinan 250014, Peoples R China
[3] Shandong Univ, Sch Software, Jinan 250101, Peoples R China
基金
中国国家自然科学基金;
关键词
Rain; Transformers; Feature extraction; Aggregates; Task analysis; Imaging; Image reconstruction; Video deraining; hybrid transformer; global and local representations; VDN-HT; REMOVAL; RAIN; LANGUAGE; VISION;
D O I
10.1109/TCSVT.2024.3372944
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Although video deraining technology has achieved great success in recent years, extracting spatiotemporal feature representations across the domains of spatial and temporal in successive frames, then performing spatial and temporal modeling, and restoring high-quality deraining videos with rich details are still challenging tasks. In this paper, we use the hybrid Transformer for the first attempt in video rain removal tasks, and propose a novel video deraining network based on hybrid transformer (VDN-HT) to aggregate global and local representations to accomplish video deraining. In the feature extraction process, we propose to use a U-shaped structure based on serial Transformer blocks to extract shallow local features, deep global features and global dependencies, and then adaptively aggregate them to obtain rainy video features with rain streaks of different directions and densities. In order to better model spatiotemporal relationships, the VDN-HT uses the Transformer's long-range and relational modeling abilities to obtain the features of spatial and the correlations of temporal between continuous video frames to achieve multi-frame alignment. For ensuring the global-local consistency of the reconstructed frames, we design a global-local reconstruction module composed of Transformer and convolutional neural network (CNN) in parallel to aggregate global and local information to better reconstruct each frame. In addition, the proposed gating-based refinement module and color loss effectively retain the details and color information after removing rain streaks. Extensive experiments on NTURain, RainSynLight25 and RainSynHeavy25 datasets have shown that the VDN-HT can handle many types of rainy videos and perform better than previous methods.
引用
收藏
页码:7512 / 7522
页数:11
相关论文
共 50 条
  • [1] Aggregating Global Features into Local Vision Transformer
    Patel, Krushi
    Bur, Andres M.
    Li, Fengjun
    Wang, Guanghui
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1141 - 1147
  • [2] PHTrans: Parallelly Aggregating Global and Local Representations for Medical Image Segmentation
    Liu, Wentao
    Tian, Tong
    Xu, Weijin
    Yang, Huihua
    Pan, Xipeng
    Yan, Songlin
    Wang, Lemeng
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT V, 2022, 13435 : 235 - 244
  • [3] Rethinking Multi-Scale Representations in Deep Deraining Transformer
    Chen, Hongming
    Chen, Xiang
    Lu, Jiyang
    Li, Yufeng
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1046 - 1053
  • [4] Real-Time Video Deraining via Global Motion Compensation and Hybrid Multi-Scale Temporal Correlations
    Wang, Xiaofen
    Chen, Jun
    Han, Zhen
    Zhu, Qikui
    Ruan, Weijian
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 672 - 676
  • [5] Image deraining via invertible disentangled representations
    Chen, Xueling
    Zhou, Xuan
    Sun, Wei
    Zhang, Yanning
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 137
  • [6] Contrastive Learning of Global-Local Video Representations
    Ma, Shuang
    Zeng, Zhaoyang
    McDuff, Daniel
    Song, Yale
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [7] Aggregating nearest sharp features via hybrid transformers for video deblurring
    Shang, Wei
    Ren, Dongwei
    Yang, Yi
    Zuo, Wangmeng
    INFORMATION SCIENCES, 2025, 694
  • [8] VIDEO DERAINING VIA TEMPORAL AGGREGATION-AND-GUIDANCE
    Ma, Long
    Liu, Risheng
    Zhang, Xuefeng
    Zhong, Wei
    Fan, Xin
    Proceedings - IEEE International Conference on Multimedia and Expo, 2021,
  • [9] DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition
    Liang, Yuxuan
    Zhou, Pan
    Zimmermann, Roger
    Yan, Shuicheng
    COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 577 - 595
  • [10] GDALR: Global Dual Attention and Local Representations in transformer for surface defect detection
    Zhou, Xin
    Zhou, Shihua
    Zhang, Yongchao
    Ren, Zhaohui
    Jiang, Zeyu
    Luo, Hengfa
    MEASUREMENT, 2024, 229