Scene Text Recognition with Transformer using Multi-patches

被引:0
|
作者
Wang Y. [1 ]
Ha J.-E. [2 ]
机构
[1] Graduate School of Automotive Engineering, Seoul National University of Science and Technology
[2] Department of Mechanical and Automotive Engineering, Seoul National University of Science and Technology
关键词
Deep learning; Scene text recognition; Transformer;
D O I
10.5302/J.ICROS.2022.22.0107
中图分类号
学科分类号
摘要
In this paper, we explore the application of Vision transformer (ViT) to the scene text recognition task. As a popular research direction in computer vision, Scene text recognition enables computers to recognize or read the text in natural scenes, such as object labels, text descriptions, and road text signs. At present, the traditional convolutional neural network-based model has better performance. Still, in the face of complex backgrounds and irregular scene text pictures, the performance of the convolutional neural network-based model is challenging to improve in curved text, diverse fonts, distortions, etc. With the application of transformers in computer vision, the model structure based on transformers has also significantly been developed. Although the current transformer-based model can obtain the performance of the model structure similar to CNN, it is currently in the early stage of application, and there is much room for research and improvement. We propose a multi-scale vertical rectangular patch model (MSVSTR) for transformer-based feature extractor to be more suitable for text images. By only arranging the patches in a single direction, when the image is cropped through the patch, it can be more suitable for the distribution form of the text in the text image. At the same time, to be suitable for different numbers of characters in other texts and more robust feature extraction, vertical rectangular patches of different scales are applied to crop the image. Our structure performs better through various ablation experiments than similar transformer-based STR models. At the same time, experiments show that our structure can perform seven benchmarks well. © ICROS 2022.
引用
收藏
页码:862 / 867
页数:5
相关论文
共 50 条
  • [21] Multi-dielectric layer multi-patches microstrip antenna for UWB applications
    Nasimuddin
    Chen, Zhi Ning
    See, Terence S. P.
    Qing, Xianming
    2007 EUROPEAN MICROWAVE CONFERENCE, VOLS 1-4, 2007, : 1019 - 1021
  • [22] Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition
    Xie, Xudong
    Fu, Ling
    Zhang, Zhifei
    Wang, Zhaowen
    Bai, Xiang
    COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 303 - 321
  • [23] SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin Transformer
    Shuai, Xiang
    Wang, Xiao
    Wang, Wei
    Yuan, Xin
    Xu, Xin
    MULTIMEDIA MODELING (MMM 2022), PT I, 2022, 13141 : 443 - 454
  • [24] TextViTCNN: Enhancing Natural Scene Text Recognition with Hybrid Transformer and Convolutional Networks
    Eli, Elham
    Xi, Wenting
    Aysa, Alimjan
    Mamat, Hornisa
    Ubul, Kurban
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII, 2025, 15037 : 261 - 275
  • [25] Movie Scene Recognition Using Panoramic Frame and Representative Feature Patches
    Gao, Guang-Yu
    Ma, Hua-Dong
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2014, 29 (01) : 155 - 164
  • [26] Movie Scene Recognition Using Panoramic Frame and Representative Feature Patches
    Guang-Yu Gao
    Hua-Dong Ma
    Journal of Computer Science and Technology, 2014, 29 : 155 - 164
  • [27] A hybrid isogeometric approach on multi-patches with applications to Kirchhoff plates and eigenvalue problems
    Horger, Thomas
    Reali, Alessandro
    Wohlmuth, Barbara
    Wunderlich, Linus
    COMPUTER METHODS IN APPLIED MECHANICS AND ENGINEERING, 2019, 348 : 396 - 408
  • [28] Movie Scene Recognition Using Panoramic Frame and Representative Feature Patches
    高广宇
    马华东
    Journal of Computer Science & Technology, 2014, 29 (01) : 155 - 164
  • [29] PETR: Rethinking the Capability of Transformer-Based Language Model in Scene Text Recognition
    Wang, Yuxin
    Xie, Hongtao
    Fang, Shancheng
    Xing, Mengting
    Wang, Jing
    Zhu, Shenggao
    Zhang, Yongdong
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5585 - 5598
  • [30] CAMTNet: CTC-Attention Mechanism and Transformer Fusion Network for Scene Text Recognition
    Wang, Ling
    Luo, Kexin
    Wang, Peng
    Bai, Yane
    IAENG International Journal of Computer Science, 2024, 51 (11) : 1750 - 1760