Conformer: Local Features Coupling Global Representations for Visual Recognition

被引:486
|
作者
Peng, Zhiliang [1 ]
Huang, Wei [1 ]
Gu, Shanzhi [3 ]
Xie, Lingxi [2 ]
Wang, Yaowei [3 ]
Jiao, Jianbin [1 ]
Ye, Qixiang [1 ,3 ]
机构
[1] Univ Chinese Acad Sci, Beijing, Peoples R China
[2] Huawei Inc, Shenzhen, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
SCALE;
D O I
10.1109/ICCV48922.2021.00042
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features but experience difficulty to capture global representations. Within visual transformer, the cascaded self-attention modules can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local features and global representations are retained to the maximum extent. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet. On MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network. Code is available at github.com/pengzhiliang/Conformer.
引用
收藏
页码:357 / 366
页数:10
相关论文
共 50 条
  • [31] Learning Visual Object Categories with Global Descriptors and Local Features
    Pereira, Rui
    Lopes, Luis Seabra
    PROGRESS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2009, 5816 : 225 - 236
  • [32] Binding global and local object features in visual working memory
    Ericson, Justin M.
    Beck, Melissa R.
    van Lamsweerde, Amanda E.
    ATTENTION PERCEPTION & PSYCHOPHYSICS, 2016, 78 (01) : 94 - 106
  • [33] Binding global and local object features in visual working memory
    Justin M. Ericson
    Melissa R. Beck
    Amanda E. van Lamsweerde
    Attention, Perception, & Psychophysics, 2016, 78 : 94 - 106
  • [34] Interpreting local visual features as a global shape requires awareness
    Schwarzkopf, D. Samuel
    Rees, Geraint
    PROCEEDINGS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 2011, 278 (1715) : 2207 - 2215
  • [35] A Combined Visual Tracker based on Global Appearance and Local Features
    Yang, Tianyang
    Jin, Lizuo
    Li, Yawei
    Cui, Tong
    2016 IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION (ICIA), 2016, : 602 - 607
  • [36] IMPROVING FACE RECOGNITION USING COMBINATION OF GLOBAL AND LOCAL FEATURES
    Nor'aini, A. J.
    Raveendran, P.
    2009 6TH INTERNATIONAL SYMPOSIUM ON MECHATRONICS AND ITS APPLICATIONS (ISMA), 2009, : 433 - +
  • [37] Global and Local Features Based Topic Model for Scene Recognition
    Li, Heping
    Wang, Fangyuan
    Zhang, Shuwu
    2011 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2011, : 532 - 537
  • [38] Facial Expression Recognition from Global and a Combination of Local Features
    Praseeda, Lekshmi V.
    Sasikumar, M.
    IETE TECHNICAL REVIEW, 2009, 26 (01) : 41 - 46
  • [39] Face recognition using most discriminative local and global features
    Gao, Yong
    Wang, Yangsheng
    Feng, Xuetao
    Zhou, Xiaoxu
    18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2006, : 351 - +
  • [40] Gender Recognition Using Fusion of Local and Global Facial Features
    Mirza, Anwar M.
    Hussain, Muhammad
    Almuzaini, Huda
    Muhammad, Ghulam
    Aboalsamh, Hatim
    Bebis, George
    ADVANCES IN VISUAL COMPUTING, PT II, 2013, 8034 : 493 - 502