Leveraging Spatial-semantic Information in Object Detection and Segmentation

被引:0
|
作者
Guo Q.-Z. [1 ]
Yuan C. [2 ,3 ]
机构
[1] Department of Computer Science and Technology, Tsinghua University, Beijing
[2] Shenzhen International Graduate School, Tsinghua University, Shenzhen
[3] Pengcheng Laboratory, Shenzhen
来源
Ruan Jian Xue Bao/Journal of Software | 2023年 / 34卷 / 06期
关键词
attention mechanism; deep learning; feature fusion; image segmentation; object detection;
D O I
10.13328/j.cnki.jos.006509
中图分类号
学科分类号
摘要
High quality feature representation can boost performance for object detection and other computer vision tasks. Modern object detectors resort to versatile feature pyramids to enrich the representation power but neglect that different fusing operations should be used for pathways of different directions to meet their different needs of information flow. This study proposes separated spatial semantic fusion (SSSF) that uses a channel attention block (CAB) in top-down pathway to pass semantic information and a spatial attention block (SAB) with a bottleneck structure in the bottom-up pathway to pass precise location signals to the top level with fewer parameters and less computation (compared with plain spatial attention without dimension reduction). SSSF is effective and has a great generality ability: It improves AP over 1.3% for object detection, about 0.8% over plain addition for fusing operation of the top-down path for semantic segmentation, and boost the instance segmentation performance in all metrics for both bounding box AP and mask AP. © 2023 Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:2776 / 2788
页数:12
相关论文
共 52 条
  • [1] Ren SQ, He KM, Girshick RB, Sun J., Faster R-CNN: Towards real-time object detection with region proposal networks, (2015)
  • [2] Lin TY, Goyal P, Girshick R, He KM, Dollar P., Focal loss for dense object detection, Proc. of the 2017 IEEE Int’l Conf. on Computer Vision, pp. 2980-2988, (2017)
  • [3] Lin TY, Dollar P, Girshick R, He KM, Hariharan B, Belongie S., Feature pyramid networks for object detection, Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 936-944, (2017)
  • [4] Redmon J, Farhadi A., YOLO9000: Better, faster, stronger, Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 6517-6525, (2017)
  • [5] Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC., SSD: Single shot multibox detector, Proc. of the 14th European Conf. on Computer Vision, pp. 21-37, (2016)
  • [6] Zhao HS, Shi JP, Qi XJ, Wang XG, Jia JY., Pyramid scene parsing network, Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 6230-6239, (2017)
  • [7] Chen LC, Papandreou G, Schroff F, Adam H., Rethinking atrous convolution for semantic image segmentation, (2017)
  • [8] Yuan YH, Huang L, Guo JY, Zhang C, Chen XL, Wang JD., OCNet: Object context network for scene parsing, (2021)
  • [9] Fu J, Liu J, Tian HJ, Li Y, Bao YJ, Fang ZW, Lu HQ., Dual attention network for scene segmentation, Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 3141-3149, (2019)
  • [10] Huang ZL, Wang XG, Huang LC, Huang C, Wei YC, Liu WY., CCNet: Criss-cross attention for semantic segmentation, Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision, pp. 603-612, (2019)